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Introduction 

This collection of modules is from a Rice University, ECE Department 
Technical Report written around September 1994. It grew out of the 
doctoral and post doctoral research of Ivan Selesnick working with Prof. C. 
Sidney Burrus at Rice. Earlier reports on this work were published in the 
ICASSP and ISCAS conference proceedings in 1992-94 and a fairly 
complete report was published in the IEEE Transaction on Signal 
Processing in January 1996. 


Introduction 


The development of algorithms for the fast computation of the Discrete 
Fourier Transform in the last 30 years originated with the radix 2 Cooley- 
Tukey FFT and the theory and variety of FFTs has grown significantly since 
then. Most of the work has focused on FFTs whose sizes are composite, for 
the algorithms depend on the ability to factor the length of the data 
sequence, so that the transform can be found by taking the transform of 
smaller lengths. For this reason, algorithms for prime length transforms are 
building blocks for many composite length FFTs - the maximum length and 
the variety of lengths of a PFA or WFTA algorithm depend upon the 
availability of prime length FFT modules. As such, prime length Fast 
Fourier Transforms are a special, important and difficult case. 


Fast algorithms designed for specific short prime lengths have been 
developed and have been written as straight line code [link], [link]. These 
dedicated programs rely upon an observation made in Rader's paper [link] 
in which he shows that a prime p length DFT can be found by performing a 
p — 1 length circular convolution. Since the publication of that paper, 
Winograd had developed a theory of multiplicative complexity for 
transforms and designed algorithms for convolution that attain the 
minimum number of multiplications [link]. Although Winograd's algorithms 
are very efficient for small prime lengths, for longer lengths they require a 
large number of additions and the algorithms become very cumbersome to 
design. This has prevented the design of useful prime length FFT programs 
for lengths greater than 31. Although we have previously reported the 
design of programs for prime lengths greater than 31 [link] those programs 
required more additions than necessary and were long. Like the previously 


existing ones, they simply consisted of a long list of instructions and did not 
take advantage of the attainable common structures. 


In this paper we describe a set of programs for circular convolution and 
prime length FFTs that are are short, possess great structure, share many 
computational procedures, and cover a large variety of lengths. Because the 
underlying convolution is decomposed into a set of disjoint operations they 
can be performed in parallel and this parallelism is made clear in the 
programs. Moreover, each of these independent operations is made up of a 
sequence of sub-operations of the form J @ A ® I where ® denotes the 
Kronecker product. These operations can be implemented as vector/parallel 
operations [link]. Previous programs for prime length FFTs do not have 
these features: they consist of straight line code and are not amenable to 
vector/parallel implementations. 


We have also developed a program that automatically generates these 
programs for circular convolution and prime length DFTs. This code 
generating program requires information only about a set of modules for 
computing cyclotomic convolutions. We compute these non-circular 
convolutions by computing a linear convolution and reducing the result. 
Furthermore, because these linear convolution algorithms can be built from 
smaller ones, the only modules needed are ones for the linear convolution 
of prime length sequences. It turns out that with linear convolution 
algorithms for only the lengths 2 and 3, we can generate a wide variety of 
prime length FFT algorithms. In addition, the code we generate is made up 
of calls to a relatively small set of functions. Accordingly, the subroutines 
can be designed and optimized to specifically suit a given architecture. 


The programs we describe use Rader's conversion of a prime point DFT 
into a circular convolution, but this convolution we compute using the split 
nesting algorithm [link]. As Stasinski notes [link], this yields algorithms 
possessing greater structure and simpler programs and doesn't generally 
require more computation. 


On the Row-Column Method 


In computing the DFT of an n = n4nz2 point sequence where n, and nz are 
relatively prime, a row-column method can be employed. That is, if an 

m1 X Ng array is appropriately formed from the n point sequence, then its 
DFT can be computed by computing the DFT of the rows and by then 
computing the DFT of the columns. The separability of the DFT makes this 
possible. It should be mentioned, however, that in at least two papers [link], 
[link] it is mistakenly assumed that the row-column method can also be 
applied to convolution. Unfortunately, the convolution of two sequences 
can not be found by forming two arrays, by convolving their rows, and by 
then convolving their columns. This misunderstanding about the 
separability of convolution also appears in [link] where the author 
incorrectly writes a diagonal matrix of a bilinear form as a Kronecker 
product. If it were a Kronecker product, then there would indeed exist a 
row-column method for convolution. 


Earlier reports on this work were published in the conference proceedings 
[link], [link], [link] and a fairly complete report was published in the IEEE 
Transaction on Signal Processing [link]. Some parts of this approach appear 
in the Connexions book, Fast Fourier Transforms. This work is built on and 
an extension of that in [link] which is also in the Connexions Technical 
Report. 


Preliminaries 

This collection of modules is from a Rice University, ECE Department 
Technical Report written around September 1994. It grew out of the doctoral 
and post doctoral research of Ivan Selesnick working with Prof. C. Sidney 
Burrus at Rice. Earlier reports on this work were published in the ICASSP 
and ISCAS conference proceedings in 1992-94 and a fairly complete report 
was published in the IEEE Transaction on Signal Processing in January 1996. 


Preliminaries 


Because we compute prime point DFTs by converting them in to circular 
convolutions, most of this and the next section is devoted to an explanation 
of the split nesting convolution algorithm. In this section we introduce the 
various operations needed to carry out the split nesting algorithm. In 
particular, we describe the prime factor permutation that is used to convert a 
one-dimensional circular convolution into a multi-dimensional one. We also 
discuss the reduction operations needed when the Chinese Remainder 
Theorem for polynomials is used in the computation of convolution. The 
reduction operations needed for the split nesting algorithm are particularly 
well organized. We give an explicit matrix description of the reduction 
operations and give a program that implements the action of these reduction 
operations. 


The presentation relies upon the notions of similarity transformations, 
companion matrices and Kronecker products. With them, we describe the 
split nesting algorithm in a manner that brings out its structure. We find that 
when companion matrices are used to describe convolution, the reduction 
operations block diagonalizes the circular shift matrix. 


The companion matrix of a monic polynomial, 
M (s) = mo +mjs +--+ +mMp_18" 1 + 8” is given by 
Equation: 


Cu = 


1 —™Mn—-1 


Its usefulness in the following discussion comes from the following relation 
which permits a matrix formulation of convolution. Let 


Equation: 
X(s) = eo +418+4+-- Gas 
H(s) = hothis+-::hn-18"— 
Y(s) = yotyis+-: Yn18" 
M(s) = mo+mis+--: ty, a” ee” 
Then 
Equation: 
n—-1 
Y (s) = ( (x10) bu & y= (S chy) 
k=0 
where y = (yo, -°° aay e = (ga; aa. and Cy is the companion 


matrix of M(s). In [link], we say y is the convolution of z and h with 
respect to M(s). In the case of circular convolution, M (s) = s” — 1 and 
C',n_1 is the circular shift matrix denoted by S,,, 

Equation: 


Notice that any circulant matrix can be written as >, hyS*. 


Similarity transformations can be used to interpret the action of some 
convolution algorithms. If Cyy = T'~! AT for some matrix T' (Cy and A are 
similar, denoted Cy ~ A), then [link] becomes 

Equation: 


That is, by employing the similarity transformation given by 7’ in this way, 
the action of Sk is replaced by that of A*. Many circular convolution 
algorithms can be understood, in part, by understanding the manipulations 
made to S,, and the resulting new matrix A. If the transformation T is to be 
useful, it must satisfy two requirements: (1) 7’z must be simple to compute, 
and (2) A must have some advantageous structure. For example, by the 
convolution property of the DFT, the DFT matrix F' diagonalizes S,,, 
Equation: 


so that it diagonalizes every circulant matrix. In this case, T’z can be 
computed by using an FFT and the structure of A is the simplest possible. So 
the two above mentioned conditions are met. 


The Winograd Structure can be described in this manner also. Suppose 
M(s) can be factored as M (s) = M, (s)Mz2 (s) where M, and M, have no 
common roots, then Cy ~ (Cz, ® Cu,) where @ denotes the matrix direct 
sum. Using this similarity and recalling [link], the original convolution is 
decomposed into disjoint convolutions. This is, in fact, a statement of the 
Chinese Remainder Theorem for polynomials expressed in matrix notation. 


In the case of circular convolution, s” — 1 = [[ d|n ®q(s), so that S,, can be 
transformed to a block diagonal matrix, 


Equation: 


Ce 


1 


where q (s) is the d’* cyclotomic polynomial. In this case, each block 
represents a convolution with respect to a cyclotomic polynomial, or a 
‘cyclotomic convolution’. Winograd's approach carries out these cyclotomic 
convolutions using the Toom-Cook algorithm. Note that for each divisor, d, 
of n there is a corresponding block on the diagonal of size y(d), for the 
degree of &g (s) is y(d) where y(-) is the Euler totient function. This 
method is good for short lengths, but as n increases the cyclotomic 
convolutions become cumbersome, for as the number of distinct prime 
divisors of d increases, the operation described by >>, hi(Ce,)* becomes 
more difficult to implement. 


The Agarwal-Cooley Algorithm utilizes the fact that 
Equation: 


Sn = P*(Sn, @ Sn,)P 


where n = 1N2, (N1, 2) = 1 and P is an appropriate permutation [link]. 
This converts the one dimensional circular convolution of length n to a two 
dimensional one of length n; along one dimension and length nz along the 
second. Then an n1-point and an n2-point circular convolution algorithm can 
be combined to obtain an n-point algorithm. In polynomial notation, the 
mapping accomplished by this permutation P can be informally indicated by 
Equation: 


Y (s) = ( X (s)H(s) Jor & Y (s,t) = ( X(5,t)H (6,0) bon ayes 


It should be noted that [link] implies that a circulant matrix of size nyn2 can 
be written as a block circulant matrix with circulant blocks. 


The Split-Nesting algorithm [link] combines the structures of the Winograd 
and Agarwal-Cooley methods, so that S,, is transformed to a block diagonal 
matrix as in [link], 

Equation: 


S, ~ @W(d). 
ee 


Here Y (d) = @pjapce~C%,,,,, Where Hq (p) is the highest power of p 
dividing d, and # is the set of primes. 


Example: 
Equation: 


S45 ~ 


In this structure a multidimensional cyclotomic convolution, represented by 
W(d), replaces each cyclotomic convolution in Winograd's algorithm 
(represented by Cg, in [link]. Indeed, if the product of b;,---, b; is d and 


they are pairwise relatively prime, then Cg, ~ Cg, ®--: ® Cg, . This 
gives a method for combining cyclotomic convolutions to compute a longer 
circular convolution. It is like the Agarwal-Cooley method but requires fewer 
additions [link]. 


Prime Factor Permutations 


One can obtain S,, ® Sp, from Sp,n, when (m1, 2) = 1, for in this case, 
S;, is similar to S;, ® Sn,, N = N1N2. Moreover, they are related by a 
permutation. This permutation is that of the prime factor FFT algorithms and 
is employed in nesting algorithms for circular convolution [Link], [link]. The 
permutation is described by Zalcstein [link], among others, and it is his 
description we draw on in the following. 


Let n = n1Nn2 where (ni, 2) = 1. Define ex, (0 < k < n — 1), to be the 
standard basis vector, (0,---,0,1,0,---, 0)’, where the 1 is in the k” 
position. Then, the circular shift matrix, S,,, can be described by 
Equation: 


Sek =e (k+1)n° 
Note that, by inspection, 
Equation: 
(5, &® Oa) Cand =e (a+1)n,+n1 (b+1)n, 
where 0 < a <n; —1land0O < b< ng — 1. Because n, and nz are 
relatively prime a permutation matrix P can be defined by 
Equation: 


POR = € (atm (B)ng’ 


With this P, 
Equation: 


PS rex = Pe (k+1)n 
= © ( (k+1)n)ay tm ( (k+D)n) ay 


= © (k4+1)n, tn (k+1)n, 


and 
Equation: 


2 


(Sn, @Sn,)Per = (Sn, ® Sn, )e (k)ny tm ()n 
e 


(k+1)n,+m1 (k+1)ny* 


Since PS;,e% = (Sn, ® Sn,)Pex and P~' = P*, one gets, in the multi- 
factor case, the following. 
Lemma 


Ifn = n,---nz and nj, ..., Nx are pairwise relatively prime, then 
S, = P* (S,, ®-++® S,,)P where P is the permutation matrix given by 


Pé.—e (k)ny tri (R)ngt striae (k) ng! 


This useful permutation will be denoted here as P,,, .....,- If 
n = pp; -- + p;* then this permutation yields the matrix, Sie Q-+-@ Spek: 


k 
This product can be written simply as ® S,,«:, so that one has 
i=1 *% 


k 
— pt 
De = Pr sscen ( 8,552 ) Posen 


— 


It is quite simple to show that 
Equation: 


Pa,b,c = (Ia © Pye) Page a (Pap © Te) Pitas 


In general, one has 
Equation: 


k 
Payne = I] Cyr ee ® Ligeti; )' 
5 


t 


A Matlab function for P,» ® I, is pfp2I() in one of the appendices. This 
program is a direct implementation of the definition. In a paper by Templeton 
[link], another method for implementing P,, without “if' statements, is 
given. That method requires some precalculations, however. A function for 
Pn, ,.-n, is PFP(). It uses [link] and calls pfp21() with the appropriate 
arguments. 


Reduction Operations 


The Chinese Remainder Theorem for polynomials can be used to decompose 
a convolution of two sequences (the polynomial product of two polynomials 
evaluated modulo a third polynomial) into smaller convolutions (smaller 
polynomial products) [link]. The Winograd n point circular convolution 
algorithm requires that polynomials are reduced modulo the cyclotomic 
polynomial factors of s” — 1, 6g (s) for each d dividing n. 


When n has several prime divisors the reduction operations become quite 
complicated and writing a program to implement them is difficult. However, 
when n is a prime power, the reduction operations are very structured and 
can be done in a straightforward manner. Therefore, by converting a one- 
dimensional convolution to a multi-dimensional one, in which the length is a 
prime power along each dimension, the split nesting algorithm avoids the 
need for complicated reductions operations. This is one advantage the split 
nesting algorithm has over the Winograd algorithm. 


By applying the reduction operations appropriately to the circular shift 
matrix, we are able to obtain a block diagonal form, just as in the Winograd 
convolution algorithm. However, in the split nesting algorithm, each diagonal 
block represents multi-dimensional cyclotomic convolution rather than a one- 
dimensional one. By forming multi-dimensional convolutions out of one- 
dimensional ones, it is possible to combine algorithms for smaller 
convolutions (see the next section). This is a second advantage split nesting 


algorithm has over the Winograd algorithm. The split nesting algorithm, 
however, generally uses more than the minimum number of multiplications. 


Below we give an explicit matrix description of the required reduction 
operations, give a program that implements them, and give a formula for the 
number of additions required. (No multiplications are needed.) 


First, consider n = p, a prime. Let 


Equation: 
X (s) =a +218 +--+ + 2p 18? "' 
and recall s? — 1 = (s — 1) (s?- +5? *+.---4+8+41) =, (s)$,(s). 
The residue ( X (s))6,:s) is found by summing the coefficients of X(s). 
p-2 
The residue ( X (s))g,/s) is given by = (a, — @p_1)s*. Define R, to be 
k=0 


the matrix that reduces X(s) modulo # (s) and ®, (s), such that if 
Xo(s) = (X(s))o,(s) and X1(s) = ( X (s))5,,s) then 
Equation: 


Xo! _ py 
Ral ee 


where X, Xo and Xj are vectors formed from the coefficients of X(s), 
X  (s) and X (s). That is, 


Equation: 
11111 
1 —1 
Ry = 1 —1 
1 —1 


1 


i 
so that Rp = GC where G,, is the &, (s) reduction matrix of size 


P 
(p — 1) x p. Similarly, let X (s) = a) + 418+---4 a ye-18P and define 
R,« to be the matrix that reduces X(s) modulo  (s), #, (5), ..., Pye (8) 
such that 
Equation: 

Xo 

Xi 

|| = RX, 
Xe 


where as above, X, Xo, ..., X¢ are the coefficients of X(s), ( X (s))5,(s) » 
cay, \ X.(8)) bil) 


It turns out that Ry. can be written in terms of R,. Consider the reduction of 
X (s) = 29 +--+ +2889 by B, (s) = s — 1, 3(s) = 82 +8+1, and 
Py (s) = s° + 8° + 1. This is most efficiently performed by reducing X(s) 
in two steps. That is, calculate X'(s) = ( X(s)),3_; and 

X2(s)= ( X(s)) 045341. Then calculate Xo (s) = ( X'(s)),-1 and 
X1(s) = ( X'(s)) 524541. In matrix notation this becomes 

Equation: 


ie Tes Tee Te a 11 1 
bal =e —I,|X and Ba N51. Xe 
js ie =e 4 1. 241 


Together these become 
Equation: 


X 9 Re Iz Ig Ig 
X} ae Tz Ts —Iz X 
X» I Iz —I3 


or 
Equation: 


Xo 
Xi} = (R3 @ Ig) (R3 @ 13) X 
X29 


so that Rg = (Rz © I¢) (Rs ® Iz) where © denotes the matrix direct sum. 
Similarly, one finds that Roz = (R3 ® In4) ((R3 & Iz) ® Tig) (R3 & Ig). In 
general, one has the following. 


Lemma 
e—1 

Rye is a p* X p* matrix given by Rye = I] ((R, ® Ix) ‘oe Tye_pt+1) and 
k=0 


can be implemented with 2(p*° — 1) additions. 


The following formula gives the decomposition of a circular convolution into 
disjoint non-circular convolutions when the number of points is a prime 


power. 
Equation: 
1 
Co, 
Rye Sy Rye = 
CE, 
= Co, 
i=0 ” 
Example: 


Equation: 


It turns out that when n is not a prime power, the reduction of polynomials 
modulo the cyclotomic polynomial &,, (s) becomes complicated, and with an 
increasing number of prime factors, the complication increases. Recall, 
however, that a circular convolution of length pi ee pi can be converted 
(by an appropriate permutation) into a k dimensional circular convolution of 
length p;' along dimension 7. By employing this one-dimensional to multi- 
dimensional mapping technique, one can avoid having to perform polynomial 
reductions modulo &,, (s) for non-prime-power n. 


The prime factor permutation discussed previously is the permutation that 
converts a one-dimensional circular convolution into a multi-dimensional 
one. Again, we can use the Kronecker product to represent this. In this case, 
the reduction operations are applied to each matrix in the following way: 
Equation: 


where 
Equation: 


T= Ryn @ ++: @ Rye 


Example: 
Equation: 


1 
T(S9 @Ss5)T * = On @ | 
Cz. 


where 7’ = Ro ® Rs. 


The matrix Ryo ®---@ Ry can be factored using a property of the 
Kronecker product. Notice that (A ® B) = (A @I)(I ® B), and 
(A® BOC) =(A@I)\(I®B@I)(I &C) (with appropriate 
dimensions) so that one gets 

Equation: 


= 


k 
®R,i =|] (Im, @ Ry ® e 


i=1 1 


where m; = Tat P; w= ee 44 P;’ and where the empty product is 


taken to be 1. This factorization shows that 7’ can be implemented basically 
by implementing copies of R,-«. There are many variations on this 
factorization as explained in [link]. That the various factorization can be 
interpreted as vector or parallel implementations is also explained in [link]. 


Example: 
Equation: 


Rg ® Rs = (Ro ® Is) (Ig ® Rs) 


and 
Equation: 


Rg ® Ros ® R7 = (Ro ® Lizs) 9 ® Ros © I7) (1225 ® R77) 


When this factored form of ®R,,, or any of the variations alluded to above, is 
used, the number of additions incurred is given by 
Equation: 


i=1 P; 1-2 
: 1 
= 2NyY1-— 
i=1 P; 
ae i 
= an (5 *] 
i= Pj 


where N = p)'---p,*. 
Although the use of operations of the form Ryo ®-:-®@ Ry is simple, it 


does not exactly separate the circular convolution into smaller disjoint 
convolutions. In other words, its use does not give rise in [link] and [link] to 
block diagonal matrices whose diagonal blocks are the form @;C¢, . 
However, by reorganizing the arrangement of the operations we can obtain 
the block diagonal form we seek. 


First, suppose A, B and Care matrices of sizes a x a, b x bandc x c 
respectively. If 
Equation: 


By 
TBT \= 
| | 


where B, and Bo are matrices of sizes b; x b; and by x be, then 
Equation: 


Q(d@Becygt=|*2%2° 


fone 


where 
Equation: 


Q = I, ®T (1: 61,:)@ I. 
~~ {Ig @T (by +1:6,:)@ 1] 


Here T'(1 : b;,:) denotes the first bj rows and all the columns of T and 
similarly for T'(b; + 1 : b,:). Note that 
Equation: 


Peon 


B, 
A Gs: 
ioieelt | | ° 


That these two expressions are not equal explains why the arrangement of 
operations must be reorganized in order to obtain the desired block diagonal 
form. The appropriate reorganization is described by the expression in [link]. 
Therefore, we must modify the transformation of [link] appropriately. It 
should be noted that this reorganization of operations does not change their 
computational cost. It is still given by [link]. 


For example, we can use this observation and the expression in [link] to 
arrive at the following similarity transformation: 


Equation: 
1 
—1 Co,, 
Q(Sp, ® Sy.)Q = Co, 
Co, © Cz, 

where 
Equation: 

Ip, © ian 

_ (Rp ® I, ) 
Ty, @ Gp, ° 


1_, is a column vector of p 1's 
Equation: 


and G’, is the (p — 1) x p matrix: 
Equation: 


Gp came . = [Ip ~* 154 : 


In general we have 
Equation: 


where R = Ry, pt is given by 
Equation: 


1 
Fy agit = [] Q mpm) 
i=k 


i-1 _e; k 
with Mm; = a Djs n= Tain P;’ and 
Equation: 


2-1| 1a @ 12, @ Lops 
Q (a, p®,c) =i Ig ® Gp ® Lepi 


1 
=0 
J Tac(pe—pi+1) 


1_, and G, are as given in [link] and [link]. 


Example: 
Equation: 
1 
Co, 
C 
R(S9 ® Ss)R" = = 
Co, 
Ce, ® Ca, 
Cs, & Cs, 
where 
Equation: 
R= Rog5 


Q(9,5,1)Q(1, 9,5) 


and R can be implemented with 152 additions. 


Notice the distinction between this example and example "Reduction 
Operations". In example "Reduction Operations" we obtained from Sg ® S55 
a Kronecker product of two block diagonal matrices, but here we obtained a 
block diagonal matrix whose diagonal blocks are the Kronecker product of 
cyclotomic companion matrices. Each block in [link] represents a multi- 
dimensional cyclotomic convolution. 


e. in [link] is 


A Matlab program that carries out the operation Rye ca 


KRED(). 


function x = KRED(P,E,K, x) 
% X = KRED(P,E,K,x); 

% P : P = [P(1),...,P(K)]; 
% E : E = [E(K),...,E(K)]; 


for i 1:K 

a = prod(P(1:i-1).4E(1:i-1)); 

C = prod(P(iti:K).4E(1+1:K)); 

p= PCy); 

e = E(i); 

for j = e-1:-1:0 

X(1ia*c*(p%(jt+1))) = RED(p,a,c*(pj),x(1:a*c* 

(p%(jt+1)))); 

end 
end 


It calls the Matlab program RED( ) . 


function y = RED(p,a,c, x) 
% y = RED(p,a,c,x); 
y = zeros(a*c*p,1); 
for i = O:c:(a-1)*c 
for j = O:c-1 
y(itjt+1) = x(i*ptjtt); 
for k = O:c:c*(p-2) 
y(itjt+1) = y(itj+l) + x(i*ptj+ktct1); 
y(i*(p-1)+jtkta*c+1) = x(i*ptjtk+1) 
X(i*ptjtc*(p-1)+1); 
end 
end 
end 


These two Matlab programs are not written to run as fast as they could be. 
They are a ‘naive’ coding of Rye sang and are meant to serve as a basis for 
+ ?. k 


more efficient programs. In particular, the indexing and the loop counters can 
be modified to improve the efficiency. However, the modifications that 


minimize the overhead incurred by indexing operations depends on the 
programming language, the compiler and the computer used. These two 
programs are written with simple loop counters and complicated indexing 
operations so that appropriate modifications can be easily made. 


Inverses 


The inverse of A, has the form 


Equation: 
1 p—-1 -1 —1 —1 
1 -1 p-1 -1 —1 
1 1 
Roo =—}1 -!1 —1 p-1 -l 
le 2 ek ed 
1 —-1 —1 —1 —1 
and 
Equation: 
e—1 
=| =| 
Rye = I] ((R, & T,e-1-) @ Teen) 
k=0 


Because the inverse of Q in [link] is given by 
Equation: 


Q° = I, @ T* (:,1 : bi) @ I, Lor (:, by +1: b) @ I, 
the inverse of the matrix R described by eqs [link], [link] and [link] is given 


by 
Equation: 


k 


i = I] Q(m;, pi, ni) 


i=1 


; 1. 227 k a 
with m; = ja D;'s = [Tjeiu Pj and 
Equation: 


Ole @ 15, @ Leni Iq ® Vp @ Leni 
Q(a,r°,c) * = |] _ _ = 


jee—1 Lac(pe—pi*) 


where V, denotes the matrix in [link] without the first column. A Matlab 
program for R¢ is tKRED( ) , it calls the Matlab program tRED() .A 
Matlab program for R~ is itKRED( ) , it calls the Matlab program 
itRED() . These programs all appear in one of the appendices. 


Bilinear Forms for Circular Convolution 

This collection of modules is from a Rice University, ECE Department 
Technical Report written around September 1994. It grew out of the 
doctoral and post doctoral research of Ivan Selesnick working with Prof. C. 
Sidney Burrus at Rice. Earlier reports on this work were published in the 
ICASSP and ISCAS conference proceedings in 1992-94 and a fairly 
complete report was published in the IEEE Transaction on Signal 
Processing in January 1996. 


Bilinear Forms for Circular Convolution 


A basic technique in fast algorithms for convolution is that of interpolation. 
That is, two polynomials are evaluated at some common points and these 
values are multiplied [link], [link], [link]. By interpolating these products, 
the product of the two original polynomials can be determined. In the 
Winograd short convolution algorithms, this technique is used and the 
common points of evaluation are the simple integers, 0, 1, and —1. Indeed, 
the computational savings of the interpolation technique depends on the use 
of special points at which to interpolate. In the Winograd algorithm the 
computational savings come from the simplicity of the small integers. (As 
an algorithm for convolution, the FFT interpolates over the roots of unity.) 
This interpolation method is often called the Toom-Cook method and it is 
given by two matrices that describe a bilinear form. 


We use bilinear forms to give a matrix formulation of the split nesting 
algorithm. The split nesting algorithm combines smaller convolution 
algorithms to obtain algorithms for longer lengths. We use the Kronecker 
product to explicitly describe the way in which smaller convolution 
algorithms are appropriately combined. 


The Scalar Toom-Cook Method 


First we consider the linear convolution of two n point sequences. Recall 
that the linear convolution of h and x can be represented by a matrix vector 
product. When n = 3: 


Equation: 


ho 

hy ho x0 

ho hy ho £4 
ho hy “£2 


This linear convolution matrix can be written as hp) Hp + hi, + hoe 
where H;, are clear. 


n—-1 
The product Ly h;H;,« can be found using the Toom-Cook algorithm, an 
k=0 
interpolation method. Choose 2n — 1 interpolation points, 21,--- , 22n—1, 
and let A and C be matrices given by 
Equation: 
“0 n-1 0 -In-2] —1 
4 uy aT uy 
A= and C= 
0 -n—1 -( -2n—2 
bona °° Bnd De eames ea 


That is, A is a degree nm — 1 Vandermonde matrix and Cis the inverse of 
the degree 2n — 2 Vandermonde matrix specified by the same points 
specifying A. With these matrices, one has 

Equation: 


n—-1 
So hpHye = C{Ah*Az} 
k=0 


where * denotes point by point multiplication. The terms Ah and Az are 
the values of H(s) and X(s) at the points 71, - - - i2,_1. The point by point 


multiplication gives the values Y (71),---, Y (¢2n-1). The operation of C’ 
obtains the coefficients of Y(s) from its values at these points of 
evaluation. This is the bilinear form and it implies that 


Equation: 
Ay, = Cdiag (Ae,) A. 
Example: 
If n = 2, then equations [link] and [link] give 
Equation: 
ho 0 
hy ho 4 C{Ah* Ax} 
0 hy 
When the interpolation points are 0, 1,and —1, 
Equation: 
1 0O 0 
A= i. 1 and C=]0 .5 —.5 
—] —] .5 i) 


However, A and C’ do not need to be Vandermonde matrices as in [link]. 
For example, see the two point linear convolution algorithm in the 
appendix. As long as A and C are matrices such that Hy, = Cdiag (Ae,x) A, 
then the linear convolution of x and h is given by the bilinear form 

y = C{Ah* Ax}. More generally, as long as A, B and C are matrices 
satisfying H;, = Cdiag (Be;)A, then y = C{Bh* Ax} computes the 
linear convolution of h and x. For convenience, if C{Ah* Ax} computes 
the n point linear convolution of h and x (both h and z are n point 


sequences), then we say “(A, B, C’) describes a bilinear form for n point 
linear convolution." 


Similarly, we can write a bilinear form for cyclotomic convolution. Let d be 
any positive integer and let X(s) and H(s) be polynomials of degree 

~y(d) — 1 where ¢(-) is the Euler totient function. If A, B and C are 
matrices satisfying (Cs,)" = Cdiag (Be,)A for 0 < k < y(d) — 1, then 
the coefficients of Y (s) = (X (s)H (s))¢,.) are given by 

y = C{Bh* Az}. As above, if y = C{Bh* Ax} computes the d- 
cyclotomic convolution, then we say “(A, B, C) describes a bilinear form 
for £4 (s) convolution." 


But since (X (s)H (s))¢,:) can be found by computing the product of 


X(s) and H(s) and reducing the result, a cyclotomic convolution algorithm 
can always be derived by following a linear convolution algorithm by the 
appropriate reduction operation: If G is the appropriate reduction matrix 
and if (A, B, F’) describes a bilinear form for a y(d) point linear 
convolution, then (A, B, GF’) describes a bilinear form for &q (s) 
convolution. That is, y = GF'{Bh* Ax} computes the coefficients of 


(X (8) H (s))5,(s): 


Circular Convolution 


By using the Chinese Remainder Theorem for polynomials, circular 
convolution can be decomposed into disjoint cyclotomic convolutions. Let 
p be a prime and consider p point circular convolution. Above we found 
that 

Equation: 


and therefore 


Equation: 


Sk — R71 


Pp Pp 


Ry. 


Ck 
G, 


If (Ap, Bp, Cp) describes a bilinear form for ®, (s) convolution, then 
Equation: 


S* = R7} : di : R : R 
- Cp = By a A, ° 


and consequently the circular convolution of h and x can be computed by 
Equation: 


y = R,'C {BR,h* AR,z} 


where A=1@A,,B=10B, andC =1©C,. Wesay (A, B,C) 
describes a bilinear form for p point circular convolution. Note that if 

(D, E, F) describes a (p — 1) point linear convolution then A,, B, and C, 
can be taken to be A, = D, B, = E and C, = G,F where G, represents 
the appropriate reduction operations. Specifically, Gp is given by Equation 
42 from Preliminaries. 


Next we consider p* point circular convolution. Recall that 

S= ay (@%9Ce, | R,« as in Equation 27 from Preliminaries so that 
the circular convolution is decomposed into a set of e + 1 disjoint #,; (s) 
convolutions. If (Ap:, Bp, Cp:) describes a bilinear form for &,; (s) 


convolution and if 
Equation: 


A = 104,@-:-@ A, 
1®@B,G--: DB Bye 
C= 1OC.6)6Cy 


w 
| 


then (AR,., BR,, ie, C ) describes a bilinear form for p© point circular 
convolution. In particular, if (Dg, Ea, Fa) describes a bilinear form for d 
point linear convolution, then A,:, B,: and C,: can be taken to be 
Equation: 


Chi = Gy oe (p*) 


where G,,; represents the appropriate reduction operation and ~(-) is the 
Euler totient function. Specifically, Gp: has the following form 
Equation: 


Geet 
Gp = ea —1_p-1 ® Lpi-1 an | 
Opi-141,(p—2)p-1-1 


if p > 3, while 
Equation: 


—I[p5i-1_ 
Goi = ae |] 
012-11 


Note that the matrix R,- block diagonalizes S,- and each diagonal block 
represents a cyclotomic convolution. Correspondingly, the matrices A, B 
and C' of the bilinear form also have a block diagonal structure. 


The Split Nesting Algorithm 


We now describe the split-nesting algorithm for general length circular 
convolution [link]. Letn = pr see pe where p, are distinct primes. We 
have seen that 

Equation: 


S,=P'R”™ ( 2 @)) RP 
din 


where FP is the prime factor permutation P = fa sep and F represents 


the reduction operations. For example, see Equation 46 in Preliminaries. 
RP block diagonalizes S,, and each diagonal block represents a multi- 
dimensional cyclotomic convolution. To obtain a bilinear form for a multi- 
dimensional convolution, we can combine bilinear forms for one- 


dimensional convolutions. If (Ay: B, i, Cy) describes a bilinear form for 
Jj Z Z 


$i (s) convolution and if 


Equation: 
A = @a\nAa 
B= ®anBa 
C = Pa\nCa 
with 
Equation: 
Ag = ®pldpePAHA(p) 
Ba = @pldpePBHi(p) 
Ca = Bpldpe PC Hip) 


where Hq (p) is the highest power of p dividing d, and # is the set of 
primes, then (ARP, BLP, PRC) describes a bilinear form for n point 


circular convolution. That is 
Equation: 


y — P*R 'C{BRPh* ARPx} 


computes the circular convolution of h and z. 
As above (Ay: Bi, Cy) can be taken to be 
Jj Bf Jj 


D_),E (,\,GyiF (/,\ | where (Dg, Ea, Fa) describes a bilinear 

(Pata) Bela)» GniFy(y)) we (Da Ba Fd 

form for d point linear convolution. This is one particular choice for 

(Ay: Bi, Cys) - other bilinear forms for cyclotomic convolution that are 
J J J 


not derived from linear convolution algorithms exist. 


Example: 
A 45 point circular convolution algorithm: 
Equation: 
y = P'R 'C{BRPh* ARPx} 
where 
Equation: 
Pe P95 
a lig. 
A = 1@A3@ Ag @ As @ (Az ® As) © (Ag @ As) 
B = 1@0B30 Bo @ Bs @ (B3 ® Bs) ® (Bo ® Bs) 
C = 106C36Co OCs @ (C3 ® Cs) @ (Cy ® C5) 


and where (Ay: B, i, Cy) describes a bilinear form for Pi (s) 


convolution. 


The Matrix Exchange Property 


The matrix exchange property is a useful technique that, under certain 
circumstances, allows one to save computation in carrying out the action of 
bilinear forms [link]. Suppose 

Equation: 


y = C{Ax* Bh} 


as in [link]. When h/ is known and fixed, Bh can be pre-computed so that y 
can be found using only the operations represented by C' and A and the 
point by point multiplications denoted by *. The operation of B is absorbed 
into the multiplicative constants. Note that in [link], the matrix 
corresponding to C' is more complicated than is B. It is therefore 
advantageous to absorb the work of C instead of B into the multiplicative 
constants if possible. This can be done when y is the circular convolution of 
x and h by using the matrix exchange property. 


To explain the matrix exchange property we draw from [link]. Note that 
y = Cdiag(Az) Bh, so that Cdiag(A)B must be the corresponding 
circulant matrix, 


Equation: 
LQ n-1 Ly 
LI “0 4 bp) 
Cdiag (Ax) B = 
Ln-1 Ln-2 LO 


Since Cdiag (Ax) B = J(Cdiag(Ax)B)‘J where J is the reversal matrix, 
one gets 
Equation: 


y = C{Ax*Bh} 
= Cdiag(Az)Bh 
= J(Cdiag(Ax)B)'Jh 
= JB‘diag(Ax)C*Jh 
= JB'{Ax*C'Jh} 


As noted in [link], the matrix exchange property can be used whenever 
y = T(x)h where T(z) satisfies T (x) = J,T(a)’ Jo for some matrices J; 
and J». In that case one gets y = JB {Az*C* Joh}. 


Applying the matrix exchange property to [link] one gets 
Equation: 


y= JIPR'B{C'’R 'PJh* ARP}. 


Example: 
A 45 point circular convolution algorithm: 
Equation: 


y = JP'R'B{u* ARPx} 


where u = C'R-*PJh and 


Equation: 
Jes IE 
er ols 
A = 1@A3@ Ao @ As @ (A3 @ As) @ (Ag ®@ As) 
BY = 1@B3;@ By ® Bs @ (Bs ® Bs) @ (By ® Bs) 


Ct’ = 1e@C}EC§ GC! o (Ci @ Cf) & (CE @ Ch) 


and where (Ay: By, Cy) describes a bilinear form for &, (s) 


convolution. 


A Bilinear Form for the DFT 

This collection of modules is from a Rice University, ECE Department 
Technical Report written around September 1994. It grew out of the 
doctoral and post doctoral research of Ivan Selesnick working with Prof. C. 
Sidney Burrus at Rice. Earlier reports on this work were published in the 
ICASSP and ISCAS conference proceedings in 1992-94 and a fairly 
complete report was published in the IEEE Transaction on Signal 
Processing in January 1996. 


A Bilinear Form for the DFT 


A bilinear form for a prime length DFT can be obtained by making minor 
changes to a bilinear form for circular convolution. This relies on Rader's 
observation that a prime p point DFT can be computed by computing a 

p — 1 point circular convolution and by performing some extra additions 
[link]. It turns out that when the Winograd or the split nesting convolution 
algorithm is used, only two extra additions are required. After briefly 
reviewing Rader's conversion of a prime length DFT in to a circular 
convolution, we will discuss a bilinear form for the DFT. 


Rader's Permutation 


To explain Rader's conversion of a prime p point DFT into a p — 1 point 
circular convolution [link] we recall the definition of the DFT 
Equation: 


p-1 


y(k) = Soa (n)w™ 


n=0 


with W =exp — j 27/p. Also recall that a primitive root of p is an integer 
r such that ( r”), maps the integers m = 0,---,p — 2 to the integers 
1,---,p —1. Letting n = r~-™ and k = r', where r~™ is the inverse of r™ 
modulo p, the DFT becomes 

Equation: 


for 1 = 0,---,p — 2. The ‘DC’ term fis given by y(0) = S<?_-) a(n). By 
defining new functions 
Equation: 


which are simply permuted versions of the original sequences, the DFT 
becomes 


Equation: 
p-2 
y (I) =2(0)+ >> 2! (m)W' (l—m) 
m=0 
for 1 = 0,---,p — 2. This equation describes circular convolution and 


therefore any circular convolution algorithm can be used to compute a 
prime length DFT. It is only necessary to (i) permute the input, the roots of 
unity and the output, (ii) add x(0) to each term in [link] and (iii) compute 
the DC term. 


To describe a bilinear form for the DFT we first define a permutation matrix 
Q for the permutation above. If p is a prime and r is a primitive root of p, 
then let Q, be the permutation matrix defined by 

Equation: 


Qe (rk),-1 = €k 
for 0 < k < p— 2 where e; is the k** standard basis vector. Let the @ be a 


p — 1 point vector of the roots of unity: 
Equation: 


w= (W),---,wety’, 


If s is the inverse of r modulo p (that is, rs = 1 modulo p) and 


% = (a (1),---,a(p—1))’, then the circular convolution of [link] can be 
computed with the bilinear form of [link]: 
Equation: 


Qt JP*R'B{C'R+*PJQ,@* ARPQ,}. 


This bilinear form does not compute y(0), the DC term. Furthermore, it is 
still necessary to add the x(0) term to each of the elements of [link] to 


obtain y(1),---,y(p — 1). 


Calculation of the DC term 


The computation of y(0) turns out to be very simple when the bilinear form 
[link] is used to compute the circular convolution in [link]. The first 
element of ARPQ, in [link] is the residue modulo the polynomial s — 1, 
that is, the first element of this vector is the sum of the elements of 7. (The 
first row of the matrix, R, representing the reduction operation is a row of 
1's, and the matrices P and Q, are permutation matrices.) Therefore, the 
DC term can be computed by adding the first element of ARPQ,Z to x(0). 
Hence, when the Winograd or split nesting algorithm is used to perform the 
circular convolution of [link], the computation of the DC term requires only 
one extra complex addition for complex data. 


The addition x(0) to each of the elements of [link] also requires only one 
complex addition. By adding x(0) to the first element of 
{C'R*PJQ,w* ARPQ,z} in [link] and applying Q.IP'R* to the 
result, z(0) is added to each element. (Again, this is because the first 
column of R® is a column of 1's, and the matrices Q‘, J and P? are 
permutation matrices.) 


Although the DFT can be computed by making these two extra additions, 
this organization of additions does not yield a bilinear form. However, by 
making a minor modification, a bilinear form can be retrieved. The method 
described above can be illustrated in [link] with u = C'’R‘*PJQ,@. 


x0) 4 ~< =e vO 


tii | afl) yi) 


x(p-l) t f yip-D) 


The flow graph for the computation of the 
DFT. 


Clearly, the structure highlighted in the dashed box can be replaced by the 
structure in [link]. 


u(1)-1 


The flow graph for the 
bilinear form. 


By substituting the second structure for the first, a bilinear form is obtained. 
The resulting bilinear form for a prime length DFT is 
Equation: 


1 
= t 
y= piven | 


where w = (W®,---, wet)! a = (x (0),---,x(p—1))’, and where 
U,, is the matrix with the form 
Equation: 


1 * 
ee. wee ee. o} 


and V,, is the matrix with the form 
Equation: 


Implementing Kronecker Products Efficiently 

This collection of modules is from a Rice University, ECE Department 
Technical Report written around September 1994. It grew out of the 
doctoral and post doctoral research of Ivan Selesnick working with Prof. C. 
Sidney Burrus at Rice. Earlier reports on this work were published in the 
ICASSP and ISCAS conference proceedings in 1992-94 and a fairly 
complete report was published in the IEEE Transaction on Signal 
Processing in January 1996. 


Implementing Kronecker Products Efficiently 


In the algorithm described above we encountered expressions of the form 
A, ® Az ®---® Ay which we denote by ®@;_, A;. To calculate the 
product (®,;A;)z it is computationally advantageous to factor @; A, into 
terms of the form J @ A; ® I[link]. Then each term represents a set of 
copies of A;. First, recall the following property of Kronecker products 
Equation: 


AB®CD =(A®C\(B@D). 


This property can be used to factor ®,; A; in the following way. Let the 
number of rows and columns of A; be denoted by 7; and c; respectively. 
Then 

Equation: 


Ai; @®Ag = Axl, & I,, Ag 
a (Ay ® I,-y) (Le, ® Ay). 


But we can also write 
Equation: 
A,@®Ay = T,, Ay ®&) Aol, 
(I, ® A») (Ay ® Ley) 


Note that in factorization [link], copies of Az are applied to the data vector 
x first, followed by copies of A;. On the other hand, in factorization [link], 
copies of A, are applied to the data vector = first, followed by copies of A» 
. These two factorizations can be distinguished by the sequence in which 
A, and Ag are ordered. 


Lets compare the computational complexity of factorizations [link] and 
[link]. Notice that [link] consists of rz copies of A, and c; copies of Ag, 
therefore [link] has a computational cost of r2 2, + cj 2» where Q; is the 
computational cost of A;. On the other hand, the computational cost of 
[link] is cop Q, + 71 Qo. That is, the factorizations [link] and [link] have in 
general different computational costs when A; are not square. Further, 
observe that [link] is the more efficient factorization exactly when 
Equation: 


7214022 < 0214+ 7122 


or equivalently, when 
Equation: 


T1 — Cy T2 — C2 
—— > ———. 


Di Qs, 


Consequently, in the more efficient factorization, the operation A; applied 
to the data vector « first is the one for which the ratio (r; — c;)/Q; is the 
more negative. If ry > cy and rg < Cp then [link] is always true (Q; is 
always positive). Therefore, in the most computationally efficient 
factorization of A; ® A», matrices with fewer rows than columns are 
always applied to the data vector x before matrices with more rows than 
columns. If both matrices are square, then their ordering does not affect the 
computational efficiency, because in that case each ordering has the same 
computation cost. 


We now consider the Kronecker product of more than two matrices. For the 
Kronecker product @7_, A; there are n! possible different ways in which to 


order the operations A;. For example 


Equation: 


A; @ A, @A3 = (Ai @Ipr,) Uc, ® Az ® Ir,) (Ie,c, ® A3 
Ay @ pyr) (Ler, ® Az) (Lc, ® Az ® Ie, 
T,, ®@ Az ® I,,) (A1 ® Tesrs) (eye, ® Az 
I,, ® Ag ® I-,) (Irie, ® Az) (A1 ® Lese, 
I,.r, ® Az) (A1 ® Ip c,) (Le, ® Az ® Ie, 


) 
) 
) 
) 
) 
I,,.r, ® A3) Ur, ® Az ® Ic,) (A1 ® Lec) 


( 
( 
( 
( 
( 
( 


Each factorization of ®;A; can be described by a permutation g(-) of 
{1,...,7} which gives the order in which A; is applied to the data vector 
x. A4c1) is the first operation applied to the data vector x, A,vz) is the 
second, and so on. For example, the factorization [link] is described by the 
permutation g(1) = 3, g(2) = 1, 9(3) = 2. For n = 3, the computational 
cost of each factorization can be written as 

Equation: 


@ (9) = Qy1yeg(2)Cq(3) + 79(1) 2o(2)C9(3) + 79(1)7 9(2) 2 (3) 


In general 
Equation: 


€() =>. (iI “| Qyii) ( I] “40 ; 


i=1 j=itl 


Therefore, the most efficient factorization of ®;A; is described by the 
permutation g(-) that minimizes @. 


It turns out that for the Kronecker product of more than two matrices, the 
ordering of operations that describes the most efficient factorization of 
®,A; also depends only on the ratios (r; — c;)/2;. To show that this is the 
case, suppose u(-) is the permutation that minimizes @, then u(-) has the 
property that 


Equation: 


Tu(k) — Cu(k) u(k+1) (k+1) 
Que 2a (k+1) 
fork = 1,---,n —1.To support this, note that since u(-) is the 
permutation that minimizes @, we have in particular 
Equation: 
@(u) < @(v) 


where v(-) is the permutation defined by the following: 
Equation: 


u(i) i<ki>k+1 
v(i)=<(u(k+1) i=k 
u(k) i=khl 


Because only two terms in [link] are different, we have from [link] 
Equation: 


k+1 n k+1 nm 
¥ (In ) 2u0( H ««9] < + (Lr ) 20 I) 
A jit j=it 


which, after canceling common terms from each side, gives 
Equation: 


Qru(k)Cu(k+1) + Pu(k)Pu(k+1) < Boe)yCo(e+1) + Poe) Poe-+1)- 


Since v(k) = u(k + 1) and v(k + 1) = u(k) this becomes 
Equation: 


Quik) Cu(k+1) + Tu(k) Pucker) < Fuce+1)Cu(ky + Pu(ee1) u(r) 


which is equivalent to [link]. Therefore, to find the best factorization of 
®,A, it is necessary only to compute the ratios (r; — c;)/2; and to order 
them in an non-decreasing order. The operation A; whose index appears 
first in this list is applied to the data vector x first, and so on 


As above, if ru(e+1) > Cu(e+1) and Tu(K) < Cup) then [link] is always true. 
Therefore, in the most computationally efficient factorization of ®; Aj, all 
matrices with fewer rows than columns are always applied to the data 
vector x before any matrices with more rows than columns. If some 
matrices are square, then their ordering does not affect the computational 
efficiency as long as they are applied after all matrices with fewer rows 
than columns and before all matrices with more rows than columns. 


Once the permutation g(-) that minimizes @ is determined by ordering the 
ratios (r; — c;)/2;, ®; A; can be written as 


Equation: 
7 1 
BA =] [tae ® Aga ® hw 
where 
Equation: 
g(i)—1 
a(i)= [] y@*) 
k=1 
Equation: 


and where +(-) is defined by 
Equation: 


Some Matlab Code 


A Matlab program that computes the permutation that describes the 
computationally most efficient factorization of @7_, A; is cgc() . It also 
gives the resulting computational cost. It requires the computational cost of 
each of the matrices A; and the number of rows and columns of each. 


function [g,C] = cgc(Q,r,c,n) 

% [g,C] = cgc(Q,r,c,n); 

% Compute g and C 

% g : permutation that minimizes C 
% C : computational cost of Kronecker product of 
A(1),...,A(n) 

% Q : computation cost of A(1) 

% r : rows Of A(1i) 

% Cc : columns of A(1) 

% n : number of terms 

f = find(Q==0); 

Q(t) = eps * ones(size(Q(f))); 


Q ~ Q(:); 

r=r(:); 

c= c(:); 

L[s,g] = sort((r-c)./Q); 
C = 0; 


for 1 = 1i:n 

C = C + prod(r(g(1:1- 
1)))*Q(g(1))*prod(c(g(iti:n))); 
end 
C = round(C); 


The Matlab program kpi() implements the Kronecker product @}!_, Ai. 


function y = kpi(d,g,r,c,n,x) 

% y = kpi(d,g,r,c,n,x); 

% Kronecker Product : A(d(1)) kron ... kron 
A(d(n)) 

% g : permutation of 1,...,n 

%r i [r(1),...,F(n)] 

% Cc : [c(1),..,c(n)] 

% r(1) : rows of A(d(1i)) 

% C(1) : columns of A(d(1i)) 

% n : number of terms 


1:(g(1)-1) 
find(g==k) 
a * r(k); 


9 F Il 
ll V 


te) 
I 


=a * c(k); 


= (g(i)+1):n 
i > find(g==k) 
b = b * r(k); 
e 


b = b * c(k); 


end 
% y = (I(a) kron A(d(g(i1))) kron I(b)) * x; 
y = IAI(d(g(1)),a,b,x); 

end 


where the last line of code calls a function that implements 
(Z; ® Aggy) ® I,) x. That is, the program IAI(1, a,b, X) implements 
(I, @ A(i) @I)z. 


The Matlab program IAI implements y = (I, ® A @I,)x 


function y = IAI(A,r,c,m,n, x) 


% y = (I(m) kron A kron I(n))x 
% 1 number of rows of A 
% C number of columns of A 
v = O:n:in*(r-1); 
u = O:n:n*(c-1); 
for 1 = O:m-1 
for j = O:n-1 
y(vti*r*n+jt+1) = A * x(uti*c*nt+j+1); 
end 
end 


It simply uses two loops to implement the mn copies of A. Each copy of A 
is applied to a different subset of the elements of z. 


Vector/Parallel Interpretation 


The command J ® A @ I where ® is the Kronecker (or Tensor) product 
can be interpreted as a vector/parallel command [link], [link]. In these 
references, the implementation of these commands is discussed in detail and 
they have found that the Tensor product is “an extremely useful tool for 
matching algorithms to computer architectures [link].” 


The expression J ® A can easily be seen to represent a parallel command: 
Equation: 


I@A= 


Each block along the diagonal acts on non-overlapping sections of the data 
vector - so that each section can be performed in parallel. Since each 
section represents exactly the same operation, this form is amenable to 
implementation on a computer with a parallel architectural configuration. 
The expression A ® J can be similarly seen to represent a vector command, 
see [link]. 


It should also be noted that by employing ‘stride’ permutations, the 
command (I @ A @ I)x can be replaced by either (I @ A)x or (A @ I)x 
[link], [link]. It is only necessary to permute the input and output. It is also 
the case that these stride permutations are natural loading and storing 
commands for some architectures. 


In the programs we have written in conjunction with this paper we 
implement the commands y = (I @ A ® I)z with loops in a set of 
subroutines. The circular convolution and prime length FFT programs we 
present, however, explicitly use the form J @ A ® J to make clear the 
structure of the algorithm, to make them more modular and simpler, and to 
make them amenable to implementation on special architectures. In fact, in 
[link] it is suggested that it might be practical to develop tensor product 
compilers. The FFT programs we have generated will be well suited for 
such compilers. 


Programs for Circular Convolution 


Programs for Circular Convolution 


To write a program that computes the circular convolution of A and x using the bilinear form Equation 24 in 
Bilinear Forms for Circular Convolution we need subprograms that carry out the action of P, P*, R, R‘’, A and B* 
. We are assuming, as is usually done, that h is fixed and known so that u = C'R~'PJh can be pre-computed and 
stored. To compute these multiplicative constants u we need additional subprograms to carry out the action of C* 
and R~ but the efficiency with which we compute u is unimportant since this is done beforehand and u is stored. 


In Prime Factor Permutations we discussed the permutation P and a program for it pfp() appears in the 
appendix. The reduction operations R, R' and R~ we have described in Reduction Operations and programs for 
these reduction operations KRED( ) etc, also appear in the appendix. To carry out the operation of A and B‘ we 
need to be able to carry out the action of Ag, ® --- ® Ag, and this was discussed in Implementing Kronecker 
Products Efficiently. Note that since A and B? are block diagonal, each diagonal block can be done separately. 
However, since they are rectangular, it is necessary to be careful so that the correct indexing is used. 


To facilitate the discussion of the programs we generate, it is useful to consider an example. Take as an example 
the 45 point circular convolution algorithm listed in the appendix. From Equation 19 from Bilinear Forms for 
Circular Convolution we find that we need to compute x = Pg 5x and = Rg5x. These are the first two 
commands in the program. 


We noted above that bilinear forms for linear convolution, (Da, Ez, Fa), can be used for these cyclotomic 
convolutions. Specifically we can take Api = Dy(pi), Bpi = Egypi) and Cp: = Gpi Fp’). In this case Equation 20 
in Bilinear Forms for Circular Convolution becomes 

Equation: 


A=1@D2 0 D5 ® Ds © (D2 ® Ds) © (Do ® Du). 


In our approach this is what we have done. When we use the bilinear forms for convolution given in the appendix, 
for which D4 = Dz ® Dz and Dg = D2 ® D3, we get 
Equation: 


A=16 D>» @ (D2 ® D3) ® (D2 ® Dz) (Dz ® D2 ® D2) © (Dz ® D3 ® D2 ® Dz) 


and since Eg = Dz for the linear convolution algorithms listed in the appendix, we get 
Equation: 


B=16 D}® (D} ® D3) @ (D3 ® Dj) @ (D3 ® Dj ® D}) © (D} ® D3 ® D5, @ D5). 


From the discussion above, we found that the Kronecker products like Dz ® D2 ® Dz appearing in these 
expressions are best carried out by factoring the product in to factors of the form I, ® D2 ® Ip. Therefore we need 
a program to carry out (Ig ® Dz @ Iy)x and (Ig © D3 ® Iy)x. These function are called ID2I (a,b, x) and 
ID3I(a,b, x) and are listed in the appendix. The transposed form, (Ts @ Di @ I,)a, is called 

TD 2Zti((ayon x) . 


To compute the multiplicative constants we need C’. Using Ci = GpiF pip) we get 
Equation: 
C' = 10 FiG4 & FjG5 © FiGs © (FiG3 ® F{G$) @ (FEG5 ® F{G5) 
= 10 F,G) © Fy) © FiGs © (Fy ® Ff) (G5 ® Gs) © (F5 @ Fy) (Gy @ G5). 


The Matlab function KF t carries out the operation Fy, ® --- Fa,. The Matlab function Kcrot implements the 
operation Gya @-:: Gok. They are both listed in the appendix. 


Common Functions 


By recognizing that the convolution algorithms for different lengths share a lot of the same computations, it is 
possible to write a set of programs that take advantage of this. The programs we have generated call functions from 
a relatives small set. Each program calls these functions with different arguments, in differing orders, and a 
different number of times. By organizing the program structure in a modular way, we are able to generate 
relatively compact code for a wide variety of lengths. 


In the appendix we have listed code for the following functions, from which we create circular convolution 
algorithms. In the next section we generate FFT programs using this same set of functions. 


¢ Prime Factor PermutationsThe Matlab function pfp implements this permutation of Prime Factor 
Permutations. Its transpose is implemented by pfpt . 

e Reduction OperationsThe Matlab function KRED implements the reduction operations of Reduction 
Operations. Its transpose is implemented by tKRED . Its inverse transpose is implemented by itKRED and 
this function is used only for computing the multiplicative constants. 

¢ Linear Convolution OperationsID21 and ID3I are Matlab functions for the operations I ® Dz ® J and 
I ® D3 ® I. These linear convolution operations are also described in the appendix “Bilinear Forms for 
Linear Convolution.’ ID2tI and ID3tI implement the transposes, I @ D5 @ I andI @ dD, @l. 


Operation Counts 


[link] lists operation counts for some of the circular convolution algorithms we have generated. The operation 
counts do not include any arithmetic operations involved in the index variable or loops. They include only the 
arithmetic operations that involve the data sequence x in the convolution of x and h. 


The table in [link] for the split nesting algorithm gives very similar arithmetic operation counts. For all lengths not 
divisible by 9, the algorithms we have developed use the same number of multiplications and the same number or 
fewer additions. For lengths which are divisible by 9, the algorithms described in [link] require fewer additions 
than do ours. This is because the algorithms whose operation counts are tabulated in the table in [link] use a special 
®, (s) convolution algorithm. It should be noted, however, that the efficient &y (s) convolution algorithm of [link] 
is not constructed from smaller algorithms using the Kronecker product, as is ours. As we have discussed above, 
the use of the Kronecker product facilitates adaptation to special computer architectures and yields a very compact 
program with function calls to a small set of functions. 


N muls adds N muls adds N muls adds N r 
2 2 4 24 56 244 80 410 1546 240 1 
3 4 11 27 94 485 84 320 1712 252 L 
4 5 15 28 80 416 90 380 1858 270 Ul 


5 10 31 30 80 386 105 640 2881 280 2: 


6 8 34 35 160 707 108 470 2546 315 
7 16 71 36 95 493 112 656 2756 336 
8 14 46 40 140 568 120 560 2444 360 
9 19 82 42 128 718 126 608 3378 378 
10 20 82 45 190 839 135 940 4267 420 
12 20 92 48 164 656 140 800 3728 432 
14 32 170 54 188 1078 144 779 3277 504 
15 40 163 56 224 1052 168 896 4276 540 
16 41 135 60 200 952 180 950 4466 560 
18 38 200 63 304 1563 189 1504 7841 630 
20 50 214 70 320 1554 210 1280 6182 720 
21 64 317 72 266 1250 216 1316 6328 756 
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Operation counts for split nesting circular convolution algorithms 


It is possible to make further improvements to the operation counts given in [link][link], [link]. Specifically, 
algorithms for prime power cyclotomic convolution based on the polynomial transform, although more 
complicated, will give improvements for the longer lengths listed [link], [link]. These improvements can be easily 
included in the code generating program we have developed. 


Programs for Prime Length FFTs 


Programs for Prime Length FFTs 


Using the circular convolution algorithms described above, we can easily design algorithms for prime 
length FFTs. The only modifications that needs to be made involve the permutation of Rader [link] 
and the correct calculation of the DC term (y(0)). These modifications are easily made to the above 
described approach. It simply requires a few extra commands in the programs. Note that the 
multiplicative constants are computed directly, since we have programs for all the relevant 
operations. 


In the version we have currently implemented and verified for correctness, we precompute the 
multiplicative constants, the input permutation and the output permutation. From Equation 8 from A 
Bilinear Form for the DFT, the multiplicative constants are given by V, (1 @eCc'R'PJ Qs)w, the 
input permutation is given by 1 @ PQ,, and the output permutation is given by 1 6 Q'JP*. The 
multiplicative constants, the input and output permutation are each stored as vectors. These vectors 
are then passed to the prime length FFT program, which consists of the appropriate function calls, see 
the appendix. In previous prime length FFT modules, the input and output permutations can be 
completely absorbed in to the computational instructions. This is possible because they are written as 
straight line code. It is simple to modify the code generating program we have implemented so that it 
produces straight line code and absorbs the permutations in to the computational program 
instructions. 


In an in-place in-order prime factor algorithm for the DFT [link], [link], the necessary permuted 
forms of the DFT can be obtained by modifying the multiplicative constants. This can be easily done 
by permuting the roots of unity, w, in the expression for the multiplicative constants [link], [link], 
nothing else in the structure of the algorithm needs to be changed. By changing the multiplicative 
constants, it is not possible, however, to omit the permutation required for Rader's conversion of the 
prime length DFT in to circular convolution. 


Operation Counts 


[link] lists the arithmetic operations incurred by the FFT programs we have generated and verified for 
correctness. Note that the number of additions and multiplications incurred by the programs we have 
generated are the same as previously existing programs for prime lengths up to and including 13. For 
p = 17 a program with 70 multiplications and 314 additions has been written, and for p = 19 a 
program with 76 multiplications and 372 additions has been written [link]. Thus for the length 

p = 17, the program we have generated requires fewer total arithmetic operations, while for p = 19, 
ours uses more. 


There are several table of operation counts in [link], each table corresponding to a different variation 
of the algorithms used in that paper. For each variation, the algorithms we have described use fewer 
additions and fewer multiplications. The focus of [link], however, is the implementation of prime 
point FFT on various computer architectures and the advantage that can be gained from matching 
algorithms with architectures. It should be noted that the highest prime in [link] for which an FFT 
was designed is 29. Although we have not executed the programs described in this paper on these 
computers, they are, as mentioned above, written to be easily adapted to parallel/vector computers. 
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Conclusion 


Conclusion 


We have found that by using the split nesting algorithm for circular 
convolution a new set of efficient prime length DFT modules that cover a 
wide variety of lengths can be developed. We have also exploited the 
structure in the split nesting algorithm to write a program that automatically 
generates compact readable code for convolution and prime length FFT 
programs. 


The resulting code makes clear the organization and structure of the 
algorithm and clearly enumerates the disjoint convolutions into which the 
problem is decomposed. These independent convolutions can be executed 
in parallel and, moreover, the individual commands are of the form 

I @ A ® I which can be executed as parallel/vector commands on 
appropriate computer architectures[link]. By recognizing also that the 
algorithms for different lengths share many of the same computational 
structures, the code we generate is made up of calls to a relatively small set 
of functions. Accordingly, the subroutines can be designed to specifically 
suit a given architecture. 


The number of additions and multiplications incurred by the programs we 
have generated are the same as or are competitive with existing prime 
length FFT programs. We note that previously, prime length FFTs were 
made available for primes only up to 29. As in the original Winograd short 
convolution algorithms, the efficiency of the resulting prime p point DFT 
algorithm depends largely upon the factorability of p — 1. For example, if 
p — 1 is two times a prime, then an efficient p point DFT algorithm is more 
difficult to develop. 


It should be noted too that the programs for convolution developed above 
are useful in the convolution of long integer sequences when exact results 
are needed. This is because all multiplicative constants in an 7 point integer 
convolution are integer multiples of 1/n and this division by n can be 
delayed until the last stage or can simply be omitted if a scaled version of 
the convolution is acceptable. 


By developing a large library of prime point FFT programs we can extend 
the maximum length and the variety of lengths of a prime factor algorithm 
or a Winograd Fourier transform algorithm. Furthermore, because the 
approach taken in this paper gives a bilinear form, it can be incorporated 
into the dynamic programming technique for designing optimal composite 
length FFT algorithms [link]. The programs described in this paper can also 
be adapted to obtain discrete cosine transform (DCT) algorithms by simply 
permuting the input and output sequences [link]. 


Appendix: Bilinear Forms for Linear Convolution 


Appendix: Bilinear Forms for Linear Convolution 


The following is a collection of bilinear forms for linear convolution. In each case (Dy, Dn, F,) describes a 
bilinear form for n point linear convolution. That is 
Equation: 


y = F,{D,h* D2} 


computes the linear convolution of the n point sequences h and x. 


For each D,, we give Matlab programs that compute D,,x and D‘,z, and for each F,, we give a Matlab program 
that computes F'!x. When the matrix exchange algorithm is employed in the design of circular convolution 
algorithms, these are the relevant operations. 


2 point linear convolution 


D2 can be implemented with 1 addition, Dp with two additions. 


Equation: 
1 0 
D,= 01 
1 1 
Equation: 
1 0 0 
Fy= -1 -1 1 
0 1 O 


function y = D2(x) 
y = zeros(3,1); 


y(1) = x(4); 
y(2) = x(2); 
y(3) = x(1) + x(2); 


function y = D2t(x) 
y = zeros(2,1); 
y(1) = x(1)+x(3); 
y(2) = x(2)+x(3); 


function y = F2t(x) 
y = zeros(3,1); 


y(1) = x(1)-x(2); 
y(2) = -x(2)+x(3); 
y(3) = x(2); 


3 point linear convolution 


D3 can be implemented in 7 additions, Ds in 9 additions. 
Equation: 


1 0 0 

1 1 1 

D3= 1 -1 1 

1 2 4 

0 0 1 

Equation: 

6 0 O 0 
-3 6 -2 -Il1 
F3 = = -6 3 3 O 
3-3 -1 1 
0 0 O 0 


function y = D3(x) 


y = zeros(5,1); 

a = x(2)+x(3); 

b = x(3)-x(2); 
y(1) = x(1); 

y(2) = x(1)+a; 
y(3) = x(1)+b; 
y(4) = atatbty(2),; 
y(5) = x(3); 


function y = D3t(x) 

y = zeros(3,1); 

y(1) = x(2)+x(3)+x(4); 
a = x(4)+x(4); 


y(2) = x(2)-x(3)+a; 
y(3) = y(1)+x(4)+a; 
y(1) = y(1)+x(1); 
y(3) = y(3)+x(5); 


function y = F3t(x) 
y = zeros(5,1); 


y(1) = 6*x(1)-3*x(2) -6*x(3)+3*x(4); 
y(2) = 6*x(2)+3*x(3)-3*x(4); 

y(3) = -2*x(2)+3*x(3)-x(4); 

y(4) = -x(2)+x(4); 

y(5) = 12*x(2)-6*x(3)-12*x(4)+6*x(5); 
y = y/6; 


4 point linear convolution 


Equation: 


Dy= D2, @D»2 
Equation: 
1 0 0 0 0 0 0 
-1 -1 1 0 0 0 0 
-1 1 0 -1 0 0 1 
Fg= 1 1 -1 1 =21 =~-1 -1 
0 -1l 0 1 -1 0 =O 
0 O O -1 -1 1 O 
0 0 0 0 1 0 0 


function y = F4t(x) 
y = zeros(7,1); 


y(1) = x(1)-x(2)-x(3)+x(4); 
y(2) = -x(2)+x(3)+x(4)-x(5); 
y(3) = x(2)-x(4); 

y(4) = -x(3)+x(4)+x(5)-x(6); 
y(5) = x(4)-x(5)-x(6)+x(7); 
y(6) = -x(4)+x(6); 

y(7) = x(3)-x(4); 

y(8) = -x(4)+x(5); 

y(9) = x(4); 


6 point linear convolution 


Equation: 
Dg = D2 ® Dz 
Equation: 
6 0 O 0 0 0 0 O O 0 
-3 6 -2 -1 12 0 0 0 0 0 
-6 3 3 0 -6 0 0 0 0 0 
—-3 -3 -1 1 -12 -6 0 0 0 0 
3 -6 2 1 -6 3 -6 2 1 -12 
Fe = : 6 3-3 6 6 -3 -3 0 6 
-3 3 1 -1 12 3 3 #1 -1 12 
0 0 0 O -6 -3 6 -2 -1 6 
0 0 O 0 0 -6 3 3 0. -6 
0 0 O 0 0 3-3 -1 1 -12 
0 0 O 0 0 0 0 O O 6 


(= ea = Ta ee 


wnnondda eo 


ray 


function y = F6t(x) 
y = zeros(15,1); 


y(1) = 6*x(1)-3*x(2)-6*x(3)-3*x(4)+3*x(5)+6*x(6)-3*x(7); 
y(2) = 6*x(2)+3*x(3)-3*x(4) -6*x(5)-3*x(6)+3*x(7); 

y(3) = -2*x(2)+3*x(3)-x(4)+2*x(5)-3*x(6)+x(7); 

y(4) = -x(2)+x(4)+x(5)-x(7); 

y(5) = 12*x(2)-6*x(3)-12*x(4) -6*x(5)+6*x(6)+12*x(7)-6*x(8); 
y(6) = -6*x(4)+3*x(5)+6*x(6)+3*x(7)-3*x(8)-6*x(9)+3*x(10); 
y(7) = -6*x(5)-3*x(6)+3*x(7)+6*x(8)+3*x(9)-3*x(10); 

y(8) = 2*x(5)-3*x(6)+x(7)-2*x(8)+3*x(9)-x(10); 

y(9) = x(5)-x(7)-x(8)+x(10); 

y(10) = -12*x(5)+6*x(6)+12*x(7)+6*x(8) -6*x(9)-12*x(10)+6*x(11); 
y(11) = 6*x(4)-3*x(5)-6*x(6)+3*x(7); 

y(12) = 6*x(5)+3*x(6)-3*x(7); 

y(13) = -2*x(5)+3*x(6)-x(7); 

y(14) = -x(5)+x(7); 

y(15) = 12*x(5)-6*x(6)-12*x(7)+6*x(8); 

y = y/6; 


8 point linear convolution 


Equation: 
Dg = Dz ® D2 ® Dz 
Equation: 
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
-1 -1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
-1 1 0 -1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 
1 21 -1 1 1 =-1 -1 -1 1 £0 0 0 0 0 0 0 0 0 0 
-1 -1 0 1 -1 0 0 1 +0 -1 0 0 0 0 0 0 0 0 41 
1 -1 -1 -1 1 0 0 0 1 21 -1 0 0 0 0 0 0 -! 
1 -1 0 1 1 +0 -!1 0 1 -1 0 1 0 0 -1 0 0 -!I 
Fg= -1 -1 1 -1 -1 1 #21 =#21 =-1 -1 -1 «212 ~=+-1 ~-1«2~ «2 ~=+%721 =-1 «21 
0 1 0 -1 1 0 0 -1 0 21 1 0 -1 1 40 0 -1 0 = =0 
0 0 O 1 1 -1 0 0 0 -1 -1 1 #1 1 --1 0 0 0 40 
0 0 O0O 0 -!1 0 0 0 0 -1 1 0 -1 -1 0 1 +40 0 40 
0 0 0 0 0 0 0 0 0 21 1 -1 1 1 = -1 -1 -1 1 =O 
0 0 0 0 0 0 0 0 0 0 -1 0 1 -1 0 0 21 +0 0 
0 0 0 0 0 0 0 0 0 0 0 -1 -1 1 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 


function y = F8t(x) 

y = zeros(27,1); 

y(1) = x(1)-x(2)-x(3)+x(4) -x(5)+x(6)+x(7)-x(8); 
y(2) = -x(2)+x(3)+x(4)-x(5)+x(6)-x(7)-x(8)+x(9); 
y(3) = x(2)-x(4)-x(6)+x(8); 


oooco.oO Oo 
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y(4) = -x(3)+x(4)+x(5)-x(6)+x(7)-x(8)-x(9)+x(10) ; 


y(5) = x(4)-x(5)-x(6)+x(7)-x(8)+x(9)+x(10)-x(11); 
y(6) = -x(4)+x(6)+x(8)-x(10); 

y(7) = x(3)-x(4)-x(7)+x(8); 

y(8) = -x(4)+x(5)+x(8)-x(9); 

y(9) = x(4)-x(8); 

y(10) = -x(5)+x(6)+x(7)-x(8)+x(9)-x(10)-x(11)+x(12); 
y(11) = x(6)-x(7)-x(8)+x(9)-x(10)+x(11)+x(12)-x(13); 
y(12) = -x(6)+x(8)+x(10)-x(12); 

y(13) = x(7)-x(8)-x(9)+x(10) -x(11)+x(12)+x(13)-x(14); 
y(14) = -x(8)+x(9)+x(10)-x(11)+x(12)-x(13)-x(14)+x(15); 
y(15) = x(8)-x(10)-x(12)+x(14); 

y(16) = -x(7)+x(8)+x(11)-x(12); 

y(17) = x(8)-x(9)-x(12)+x(13); 

y(18) = -x(8)+x(12); 

y(19) = x(5)-x(6)-x(7)+x(8); 

y(20) = -x(6)+x(7)+x(8)-x(9); 

y(21) = x(6)-x(8); 

y(22) = -x(7)+x(8)+x(9)-x(10); 

y(23) = x(8)-x(9)-x(10)+x(11); 

y(24) = -x(8)+x(10); 

y(25) = x(7)-x(8); 

y(26) = -x(8)+x(9); 

y(27) = x(8); 


18 point linear convolution 
Equation: 


Dg = D2 ® D3 ® D3 


Fg and the program F18t are too big to print. 


Appendix: A 45 Point Circular Convolution Program 


Appendix: A 45 Point Circular Convolution Program 


As an example, we list a 45 point circular convolution program. 


function y = cconv45(x,u) 
% y = ccconv45(x,u) 
*% Y 


the 45 point circular convolution of x and h 


% where u is a vector of precomputed 


multiplicative constants 


x = pfp([9,5],2,x); 
permuation 

x = KRED([3,5],[2,1],2,x); 
Operations (152 Additions) 
y = zeros(45,1); 


v = ID2I(1,1,x(2:3)); 
Kron I(1)) * x(2:3) 

v = v.*u(2:4); 

y(2:3) = ID2tI(1,1,v); 
kron D2' kron I(1)) * v 
Oo See eo Soe es soe oe 


v = ID3I(2,1,x(4:9)); 
Kron I(1)) * x(4:9) 

v = ID2I(1,5,v); 

kron I(5)) * v 

v = v.*u(5:19); 

v = ID2tI(1,5,v); 

kron I(5)) * v 

y(4:9) = ID3tI(2,1,v); 


% prime factor 


% reduction 


% v = (I(1) kron D2 

a: 1=1*1 

% 3 Multiplications 

% y(2:3) = (1(1) 
a: 2=1*2 


% Vv = (1(2) kron D3 

a: 14=2*7 

% v = (I(1) kron D2 

a. 2 .6=5°4 

% 15 Multiplications 

% v = (I(1) kron D2' 
10=5*2 

% y(4:9) = (1(2) 


kron D3' kron I(1)) * v 
% 
V ID21I(1,2,x(10:13) ); 
Kron I(2)) * x(10:13) 

V ID21I(3,1,v); 

Kron I(1)) * v 

v = v.*u(20:28); 

V ID2tI(1,3,v); 

kron I(3)) * v 

y(10:13) = ID2tI(2,1,v); 


a 


kron D2' kron I(1)) * v a 
OP tits ey Bee erin ae Se tenet block 15 
v = ID21(1,4,x(14:21)); % NV 
kron I(4)) * x(14:21) a: 
v = ID2I(3,2,v); % V 
kron I(2)) * v a: 
v = ID2I(9,1,v); % V 
kron I(1)) * v a: 


v = v.*u(29:55); 

V ID2tI(1,9,v); 

Kron I(9)) * v 

V ID2tI(2,3,v); 

kron I(3)) * v 

y(14:21) = ID2tI(4,1,v); 
kron D2' kron I(1)) * v 
% 
V ID31I(2,4,x(22:45) ); 
kron I(4)) * x(22:45) 

V ID21(1,20,v); 

Kron I(20)) * v 

V ID21(15,2,v); 

kron I(2)) * v 

V ID21(45,1,v); 

Kron I(1)) * v 


a 


a 


V 


V 


V 


Vv 


(I(1) kron D2 
=2*1 

(I(3) kron D2 
3=3*1 
Multiplications 
= (I(1) kron D2' 
6=3%*2 


% y(10:13) = (I(2) 


4=2*2 


(I(1) kron D2 
=4*1 

(I(3) kron D2 
=6*1 

(I(9) kron D2 
9=9*1 


lol Bp Il 


% 27 Multiplications 
% Vv = (I(1) kron D2' 


18=9*2 


% Vv = (1(2) kron D2' 


12=6*2 


% y(14:21) = (1(4) 
a: 


8=4*2 


% Vv = (1(2) kron D3 


56=8*7 

= (I(1) kron D2 

20=20*1 

= (1(15) kron D2 
30=30*1 

= (1(45) kron D2 
45=45*1 


v = v.*u(56:190); % 135 


Multiplications 

v = ID2tI(1,45,v); % v = (I(1) kron D2' 
kron I(45)) * v a: 90=45%*2 

v = ID2tI(10,3,v); % v = (1(10) kron 
D2' kron I(3)) * v a: 60=30%*2 

v = ID2tI(20,1,v); % v = (1(20) kron 
D2' kron I(1)) * v a : 40=20%*2 
y(22:45) = ID3tI(2,4,v); % y(22:45) = (1I(2) 
kron D3' kron I(4)) * v a: 72=8%*9 

y = tKRED([3,5],[2,1],2,y); % transpose 
reduction operations (152 Additions) 

y = pfpt([9,5],2,y); % prime factor 
permuation 


y = y(45:-1:1); 


% TOtal Number of Multiplications : 190 
% Total Number of Additions: 839 


Appendix: A 31 Point FFT Program 


Appendix: A 31 Point FFT Program 
As an example, we list a 31 point FFT program. 


function y = fft31(x,u,1ip,op) 

% y = fft31(x,u,1p, op) 

% y : the 31 point DFT of x 

% U : a vector of precomputed multiplicative 
constants 

% 1p : input permutation 

% Op : Ouput permutation 


zeros(31,1); 

x(ip); 

% AInput permutation 

X(2:31) = KRED([2,3,5],[1,1,1],3,x(2:31)); 
% reduction operations 

y(1) = x(1)+x(2); 

% DC term calculation 


< 
Wot 


% -------- 7-H --- block : 1 ----------------- 


v = ID2I(1,1,x(4:5)); % v = (I1(1) 
Kron D2 kron I(1)) * x(4:5) 

v = v.*u(3:5); 

y(4:5) = ID2tI(1,1,v); % y(4:5) = 
(I(1) kron D2' kron I(1)) * v 


% ----- oer er ener eee block : 6 = 2 * 3 --------- 


v = ID2I(1,1,x(6:7)); %v = (I(1) 


kron D2 kron I(1)) * x(6:7) 
v = v.*u(6:8); 


y(6:7) = ID2tI(1,1,v); % y(6:7) = 
(I(1) kron D2' kron I(1)) * v 

% --------- err eee block : 5 ----------------- 
v = ID2I(1,2,x(8:11)); SV = (IC) 
Kron D2 kron I(2)) * x(8:11) 

v = ID21I(3,1,v); % v= (1(3) 
kron D2 kron I(1)) * v 

v = v.*u(9:17); 

v = ID2tI(1,3,v); % v= (I(1) 
kron D2' kron I(3)) * v 

y(8:11) = ID2tI(2,1,v); % y(8:11) = 
(I(2) kron D2' kron I(1)) * v 

% ------- 2-7 -r---- eee block : 10 = 2 * 5 -------- 
v = ID2I(1,2,x(12:15)); %v = (I1(1) 
kron D2 kron I(2)) * x(12:15) 

v = ID21I(3,1,v); % v= (1I(3) 
Kron D2 kron I(1)) * v 

v = v.*u(18:26); 

v = ID2tI(1,3,v); % v = (I(1) 
kron D2' kron I(3)) * v 

y(12:15) = ID2tI(2,1,v); % y(12:15) = 
(I(2) kron D2' kron I(1)) * v 

% ------- oer eee block :15 = 3 %* 5 -------- 
v = ID2I(1,4,x(16:23)); % v = (I1(1) 
kron D2 kron I(4)) * x(16:23) 

v = ID21I(3,2,v); % Vv = (1(3) 
kron D2 kron I(2)) * v 

v = ID21I(9,1,v); %v = (1I(9) 
kron D2 kron I(1)) * v 

v = v.*u(27:53); 

v = ID2tI(1,9,v); % v= (I(1) 


kron D2' kron I(9)) * v 


v = ID2tI(2,3,v); % v= (1I(2) 
kron D2' kron I(3)) * v 


y(16:23) = ID2tI(4,1,Vv); % y(16:23) = 
(I(4) kron D2' kron I(1)) * v 

% ------- ore - eee block : 30 =2 * 3% 5 ---- 
v = ID2I(1,4,x(24:31)); %v = (I1(1) 
kron D2 kron I(4)) * x(24:31) 

v = ID21I(3,2,v); % v = (1(3) 
kron D2 kron I(2)) * v 

v = ID2I(9,1,v); % v= (1I(9) 


kron D2 kron I(1)) * v 
v = v.*u(54:80); 


v = ID2tI(1,9,v); % v= (I(1) 
kron D2' kron I(9)) * v 

v = ID2tI(2,3,v); % Vv = (I(2) 
kron D2' kron I(3)) * v 

y(24:31) = ID2tI(4,1,v); % y(24:31) = 


(I(4) kron D2' kron I(1)) * v 
OF cole yO oN, ORS a SE a Rh PANY RD hl OS TS pepe, Re ie aa 


y(2) = y(1)ty(2); 

% DC term calculation 

y(2:31) = tKRED([2,3,5],[1,1,1],3,y(2:31)); 
% transpose reduction operations 

y = y(op); . 

% Output permutation 


% For complex data - 
% Total Number of Real Multiplications : 160 
% Total Number of Real Additions: 776 


The constants, input and output permutations are: 


% The multiplicative constants for the 31 point 
FFT 


sqrt(- 


a lee 
.185592145427667*I 
.251026872929094 
.638094290379888 
.296373721102994 
.462201919825109*I 
.155909426230360*I 
.102097497864916*I 
.100498239164838 
».217421331841463 
.325082164955763 
. 198589508696894 
. /80994042074251 
.256086011899669 
.169494392220932 
. /11997889018157 
.060064820876732 
.235197570427205*1I 
.271691369288525*1I 
.541789612349592*1T 
.329410560797314*I 
.317497505049809*I 
.5995088038583817*1I 
.093899154219231*I 
.176199088841836*I 
.028003825226279*I 
.316699050305790 
. 3303152 70540553 
.385122753006171 
.958666546021397 
.935301995146201 
.013474028487015 
.081897731187396 
.136705213653014 


1); 


033333333333333 


.969390844064251 
.262247009112805 
0098555 70455675 
.159348599757857 
.62936 7699727360 
»229312102919654 
.4798746 70425178 
.058279061554516 
. 908 786032252333 
»/21257672797977 
.351484013730995 
.113390280332076 
.914823784254676 
. 176432948 764679 
.435329964075516 
.177866452687279 
. 341206223210960 
.25/360272866440 
.050622276244575 
. 145673340229639*I 
.68517 7424507523*1 
.880463026400118*I 
.028851220636894*I 
.345528375980267*I 
.463210769729252*1I 
. 3284210835587 74*I 
.237219367348867*I 
.086975102467855*I 
.665522956385442*1 
. 628826188810638*I 
.534088072762272*1I 
.050496586573981*I 
.209597199290132*I 
.887582325001072*I 
.019017208624242*T 
.143897052948668*I 
.659358110687783*I 


1.470398765538361*I 
-1.438001204439387*I 
-@.471517033054130*I 

2 .693115935736959*I 

0.185041858423467*I 
-0.783597698243441*T 
-1.782479430727672*I 

0.127038806765845*I 

0.582111071051880*I 


1; 


% The input permutation for the 31 point FFT 


[| 


ip 


WOORNE il 
a | 


He 


% The output permutation for the 31 point FFT 


1; 


Appendix: Matlab Functions For Circular Convolution and Prime Length 
FFTs 


Programs for Reduction Operations 


The reduction matrix of Equation 44 in Preliminaries is implemented by 
KRED which calls RED . Its transpose and inverse transpose are 
implemented by tRED , tRED , 1tKRED and 1tRED. 


function x = KRED(P,E,K,xX) 
% X = KRED(P,E,K,xX);} 

% P : P = [P(1),...,P(K)]; 
% E : E = [E(K),...,E(K)]; 


for i 1:K 
a = prod(P(1:1-1).4E(1:1-1)); 
C = prod(P(iti:K).4E(i+1:K)); 
p = P(1); 
e = E(1i); 
for 3} =-€=1;-170 


X(L:a*c*(p4(j+1))) = RED(p,a,c* 
(prj), x(Lia*c*(p%(j+1)))); 
end 
end 


function y = RED(p,a,Cc, x) 
% y = RED(p,a,c,x); 
y = zeros(a*c*p,1); 
for 1 = O:c:(a-1)*c 
for j = O:c-1 
y(itjt1) = x(i*ptjti); 
for k = O:c:c*(p-2) 
y(itjt+l) = y(itjt1) + x(i*ptj+k+c+1); 
y(i*(p-1)+j+kta*c+1) = x(i*ptjtkt1) - 
X(i*ptjt+c*(p-1)+1); 
end 
end 
end 


function x = tKRED(P,E,K, x) 
% X = tKRED(P,E,K,xX); 

% (transpose) 

% P : P = [P(1),...,P(K)]; 
% E : E = [E(K),...,E(K)]; 
for 1 = K:-1:1 


a = prod(P(1:i-1).4E(1:i-1)); 
C = prod(P(iti:K).4E(i+1:K)); 
p = P(i); 

e = E(1i); 

T 


or j = O:e-1 
X(L:a*c*(p*(j+1))) = tRED(p,a,c* 
(p4j),x(1ia*c*(pr(j+1)))); 
end 
end 


function y = tRED(p,a,c,x) 
% y = tRED(p,a,c,Xx); 
% (transpose) 
y = zeros(a*c*p,1); 
for 1 = O:c:(a-1)*c 
for j = O:c-1 
y(i*ptjtc*(p-1)+1) = x(itjt1); 
for k = O:c:c*(p-2) 
y(i*ptjtk+1) = x(itjt1) + x(i*(p- 
1)+j+kt+a*c+1); 
y(i*ptjtc*(p-1)+1) = y(i*ptjte*(p-1)+1) - 
X(1*(p-1)+j+kta*c+1); 
end 
end 
end 


Programs for I ® Dk @I 


The operations of I, ® Dz ® I, and Im ® D3 ® I, are carried out by 
ID2I and ID31 . Their transposes by ID2tI and ID3tI . The functions 
D2 and D3 are listed in the appendix, “Bilinear Forms for Linear 


Convolution.' Two versions of ID2T are listed here. One of them calls D2 
in a loop, while the other version puts the D2 code in the loop instead of 
using a function call. There are several ways to implement the form 

I ® Dz ® I. But this is a simple and straightforward method. It is modeled 
after IAT in the text. 


function y = ID2I(m,n,x) 


‘ni2°n- 


for j = O:n-1 

y(vti*3*n+j+1) = D2(x(uti*2*nt+j+1) ); 
end 
end 


function y = ID2I(m,n,x) 
y = zeros(m*n*3,1); 
for i O:n:n*(m-1) 
12 2 
13 oa 
for j 1:n 
j2 12 +)? 
13 i3 + j; 
y(j3) = x(j2); 
y(n+j3) = x(nt+j2); 
y(2*n+j3) = x(j2) + x(nt+j2); 
end 
end 


function y = ID2tI(m,n, x) 
zeros(m*n*2,1); 


< 
nou 

fo) 
55 

= 


n 
for 1 = O:m 
for j = O:n-1 

y(vti*2*n+j+1) = D2t(x(uti*3*n+j+1)); 


end 
end 


function y = ID3I(m,n,x) 
zeros(m*n*5,1); 
O:n:4*n; 


< 
Wot ol 


y(vti*5*n+j+1) = D3(x(uti*3*nt+j+1) ); 
end 
end 


function y = ID3tI(m,n, x) 


y = zeros(m*n*3,1); 
Vv = O:n:2*n; 
U = O:n:4*n; 
for 1 = O:m-1 
for j = O:n-1 


y(vti*3*n+j+1) = D3t(x(uti*5*n+j+1)); 
end 


Appendix: A Matlab Program for Generating Prime Length FFT Programs 


function [u,ip,op,ADDS,MULTS]| = ff(p,e); 
% [Uu,1p,op,ADDS,MULTS] = ff(p,e); 
% U : multiplicative constants 
% 1p : input permutation 
% op : output permutation 


K = length(p); 
N = prod(p.‘e); 
PSN 2s 


[pr, ipr] = primitive_root(P); 
Red_Adds = 2 * N * (K - sum(1./(p.4e)) ); 
ADDS = 2 * Red_Adds; 


FS = sprintf('fft%d.m',P); 
fid = fopen(FS, 'w'); 


fprintf(fid, 'function y = fft%d(x,u,ip,op)\n',P); 
fprintf(fid, '%% y = fft%d(x,u,ip,op)\n',P); 
fprintf(fid, '%% y : the %d point DFT of x \n',P); 
fprintf(fid, '%% u : a vector of precomputed 
multiplicative constants\n'); 

fprintf(fid, '%% ip : input permutation\n'); 
fprintf(fid, '%% op : ouput permutation\n'); 


Pstr = sprintf('[%d',p(1)); 

for k = 2:K, Pstr = [Pstr, sprintf(',%d',p(k))]; 
end 

Pstr [Pstr,']']; 

Estr sprintf('[%d',e(1)); 

for k = 2:K, Estr = [Estr, sprintf(',%d',e(k))]; 
end 

Estr = [Estr,']']; 

PEstr sprintf('[%d',p(1)4e(1)); 

for k 2:K, PEstr = [PEstr, 


sprintf(',%d',p(k)4e(k))]; end 
PEstr = [PEstr,']']; 


fprintf(fid, '\n'); 

S = sprintf('y = zeros(%d,1);\n',P); 
fprintf(fid,S); 

S1 = sprintf('x = x(ip);'); 

S2 = sprintf('%% input permutation\n' ); 
fprintf(fid, '%-50s%s',S1,S2); 

S1 = sprintf(['x(2:%d) = 
KRED(',Pstr,',',Estr,',%d,x(2:%d));'],P,K,P); 
S2 = sprintf('%% reduction operations\n' ); 
fprintf(fid, '%-50s%s',S1,S2); 


e_ table = [0:e(1)]'; 
a = e(1)+1; 
for 1 = 2:K 
e_table = [kron(ones(e(1i)+1,1),e_table), 
kron([0:e(1)]',ones(a,1))]; 
a=a* (e(1)+1); 
end 
R = prod(et+1); 


RRS a pie he gig TER MULTIPLICATIVE 
CONSTANTS --------------- 3 eee 

k = rp(P,ipr,0:N); 

I = sqrt(-1); 

W = exp(-1I*2*pi*k/P); 

h = W(2:P); 

h = h(N:-1:1); 

h = pfp(p.%e,K,h); 

h = itKRED(p,e,K,h); 

u = h(1); 

S1 = sprintf('y(1) = x(1)+x(2);'); 

S2 = sprintf('%% DC term calculation\n' ); 


fprintf(fid, '%-50s%s',S1,S2); 


DC_ADDS = 2; 
ADDS = ADDS + DC_ADDS; 


SEINE 2 “wee seseeerheoes See G neler te ete he cls a oie 
SB = ' block a ea 

SC = SLINE; 

BL = 21; 


SC(BL:BL-1+length(SB)) = SB; 
fprintf(fid, '%% %s\n',SC); 

S1 = sprintf('y(2) = x(2)*u(1);'); 
fprintf(fid, '%-40s\n',S1); 

a= 1: 


MULTS = 1; 
for 1 = 2:R 
v = e_table(i,:); 
f = find(v>0); 
q = p(t); 
BS Von), 
L = prod(q-1)*prod(q.4(t-1)); 


B = prod(q.t); 

bs = sprintf('%d',q(1)t(1)); 

for k = 2:length(q), bs = [bs, sprintf(' * 

*d',Q(k)At(k))]; end 

if length(q) > 1 
SB = sprintf(' block : %d = %s ',B,bs); 
SC = SLINE; 
SC(BL:BL-1it+length(SB)) = SB; 
fprintf(fid, '%% %s\n',SC); 

else 
SB = sprintf(' block : %d ',B); 
SC = SLINE; 
SC(BL:BL-1it+length(SB)) = SB; 
fprintf(fid, '%% %s\n',SC); 


end 
if prod(q.4t) == 2 
S1 = sprintf('y(%d) = 
x(%d)*u(%d);',at2,a+2,MULTS+1) ; 
fprintf(fid, '%-40s\n',S1); 
Mk = 
eae? 
Sedge = ie =) Ol. BLS. By 
a j= 1: length(q) 
[dk, rk, ck,Qk,Qtk] = A_data(q(j)4t(j)); 
if dk > 1 
d = [d dk]; r = [r rk]; c = [c ck]; Q 
= [Q Qk]; Qt = [Qt Qtk]; 
end 
end 
[g,C1] = cgc(Q,r,c,length(Q) ); 
ADDS = ADDS + C1; 
= prod(r); 
BEG = int2str(a+2); FIN = int2str(ati+L); 
XX = ['x(',BEG,':',FIN,')']; YY = 'v'; 
kpi(d,g,r,c,length(Q),YY, XX, fid); 
S1 = ['v = 


v.*u(',int2str(MULTS+1), ':',int2str(MULTS+Mk),');' 


fprintf(fid, '%-40s\n',S1); 

[g,C2] = cgc(Qt,c,r,length(Q)); 

ADDS = ADDS + C2; 

XX = 'v'; YY = ['y(',BEG,':',FIN,')']; 
kpit(d,g,c,r,length(Q), YY, XX, fid); 


end 

c=[],; 

r= [1]; 

lq = length(q); 
for j = 1i:lq 


a rk,ck] = C_data(q(j),t(j)); 
[r rk]; c = [c ck]; 


end 
P= (O-d)e (Oe8 (t=); 
temp = Kcrot(q,t,1lq,h(at1i:a+L)); 
temp = KFt(f,r,c,temp); 
u = [u; temp(:)]; 
a=art+l; 
MULTS = MULTS + Mk; 
end 


u(1) = u(1)-1; 

fprintf(fid, '%% %s\n',SLINE); 

S1 = sprintf('y(2) = y(1)+y(2);"'); 

S2 = sprintf('%% DC term calculation\n'); 
fprintf(fid, '%-50s%s',S1,S2); 

S1 = sprintf(['y(2:%d) = 
tKRED(',Pstr,',',Estr,',%d,y(2:%d));'],P,K,P); 
S2 = sprintf('%% transpose reduction 
Operations\n' ); 

fprintf(fid, '%-50s%s',S1,S2); 

S1 = sprintf('y = y(op);'); 

S2 = sprintf('%% output permutation\n' ); 
fprintf(fid, '%-50s%s',S1,S2); 
fprintf(fid, '\n'); 


MULTS = 2 * MULTS; 

ADDS = 2* ADDS; 

fprintf(fid, '%% For complex data - \n'); 
fprintf(fid, '%% Total Number of Real 
Multiplications : %d\n',MULTS); 
fprintf(fid, '%% Total Number of Real Additions: 
%d\n\n',ADDS) ; 
fclose(fid); 


LLLKUKK“KLVLLLHKUNUHW%U%% CO M P U T E I N Pp U T A N D 0) U af Pp U T 


P E R M U TAT I O N S ALO, OWL TLSL SATA TATA SA SATA SLL SLT TA SA SASL SASASLSLILS 


id = 1:P; % identity permutation 


ip = rp(P,pr,id), 
ip(2:P) = pfp(p.%e,K,ip(2:P)); 


op = id; 

op(2:P) = pfpt(p.%e,K,op(2:P) ); 
op(2:P) = op(P:-1:2); 

op = rpt(P,ipr,op); 


AHHH PUT MULTIPLICATIVE CONSTANTS AND 
PERMUTATIONS IN A FILE %%%%%%%%%2%%2%76% 


CFS = sprintf('cap%d.m',P); 
fid = fopen(CFS, 'w'); 
fprintf(fid, '\n%% The multiplicative constants for 
the %d point FFT\n\n',P); 
fprintf(fid, 'I = sqrt(-1);\n'); 
fprintf(fid, 'u = [\n'); 
for k = 1:MULTS/2 
if abs(real(u(k))) < 0.000001 
fprintf(fid, '%25.15f*I\n',imag(u(k))); 
elseif abs(imag(u(k))) < 0.00001 
fprintf(fid, '%25.15f\n', real(u(k))); 
else 
fprintf(fid, '%25.15f + 
%25.15f*I\n', real(u(k)),imag(u(k))); 
end 
end 
fprintf(fid,'];\n\n'); 
fprintf(fid, '\n%% The input permutation for the %d 
point as nie 
Aas = [\n'); 
for k = 1: 
Se %d\n',ip(k)); 
end 
fprintf(fid,'];\n\n'); 
fprintf(fid, '\n%% The output permutation for the 
%d point FFT\n\n',P); 


fprintf(fid, 'op = [\n'); 
for k = 1:P 
fprintf(fid, ' %d\n',op(k)); 
end 
fprintf(fid,'];\n\n'); 
fclose(fid); 


The following programs print the program statements that carry out the 
operation J ®@ D;, ® J andI ® D', ® I. They are modeled after kpi in the 


text. 


function kpi(d,g,r,c,n,Y,X,f1d) 

% kpi(d,g,r,c,n,Y,X,fid); 

% Kronecker Product : A(d(1)) kron ... kron 
A(d(n)) 

% g : permutation of 1,...,n 

%r i [r(1),-.-,r(n)] 

% Cc : [c(1),..,c(n)] 

% r(1) : rows of A(d(1i)) 

% C(1) : columns of A(d(i)) 

% n : number of terms 


1:(g(1)-1) 
find(g==k) 
a * r(k); 


9 F Il 
ll V 


te) 
I 


=a * c(k); 


i > find(g==k) 
b = b * r(k); 
e 


b * c(k); 


end 
% Y = (I(a) kron A(d(g(i))) kron I(b)) * X; 
if i1==1 


S1 = sprintf([Y,' = ID%dI(%d, %d X,'); 
'],d(g(1)),a,b); 
S2 = sprintf(['%% ',Y,' = (1(%d) kron D%d 


kron I(%d)) * ',X],a,d(g(1)),b); 
Fprintf(fid, '%-35s%s\n',S1,S2); 
elseif d(g(1)) ~= 


S1 = sprintf([Y,' = ID%dI(%d, %d b aoa ee 
'],d(g(1)),a,b); 
S2 = sprintf(['%% ',Y,' = (1(%d) kron D%d 


kron I(%d)) * ',Y],a,d(g(i)),b); 
fprintf(fid, '%-35s%s\n',S1,S2); 
end 
end 


function kpit(d,g,r,c,n,Y,X,fid) 
% kpit(g,r,c,n,Y,X,fi1d); 
% (transpose) 


% Kronecker Product : A(d(1))' kron ... kron 
A(d(n))' 

% g : permutation of 1,...,n 

%r i [r(1),-.-,r(n)] 


% C [c(1),..,¢(n)] 

% r(1) : rows of A(d(i))' 

% C(1) : columns of A(d(i))' 
% n : number of terms 


1:(g(1)-1) 
find(g==k) 
a * r(k); 


9 F- Il 
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a * c(k); 


b= 1; 
for k = (g(1i)+1):in 
if 1 > find(g==k) 
b = b * r(k); 
else 
= b * c(k); 
end 
end 
% X = (I(a) kron A(d(g(1)))'' kron I(b)) * x 
if i ==n 
S1 = sprintf([Y,' = ID%dtI(%d,%d,',X,'); 
'],d(g(1)),a,b); 
S2 = sprintf(['%% ',Y,' = (1(%d) kron D%d'' 


kron I(%d)) * ',X],a,d(g(1)),b); 
Fprintf(fid, '%-35s%s\n',S1,S2); 
elseif d(g(1)) ~= 


S1 = sprintf([X,' = ID%dtI(%d, %d > raat Be 
'],d(g(1)),a,b); 
S2 = sprintf(['%% ',X,' = (1(%d) kron D%d'' 


kron I(%d)) * ',X],a,d(g(i)),b); 
fprintf(fid, '%-35s%s\n',S1,S2); 
end 
end 


Programs for Computing Multiplicative Constants 


The following programs carry out the operation of Fy, ®-:: ® Fa, where 
F is the reconstruction matrix in a linear convolution algorithm. See the 
appendix, “Bilinear Forms for Linear Convolution.’ 


function u = KFt(f,r,c,u) 

% U = (FAt kron ... kron FAt)*u 
% (transpose) 

% f = [f(1),..-,F(K)] 

% ri: (lL) rows of F(1) 

Me C3 GEL) columns of F(1) 


% u : Length(u) = prod(c); 
K = length(f); 
for 1 = 1:K 


m = prod(c(1:1i-1)); 
n = prod(r(iti:K)); 
u = IFtI(f(1),r(1),c(1i),m,n,u); 


end 


function y = IFtI(s,r,c,m,n,x); 
% y = (I(m) kron F(s)4t kron I(n))*x 
% (transpose) 
% r : rows of F(S) 
% C : columns of F(s) 
= O:n:n*(c-1); 
= O:n:in*(r-1); 
= O:m-1 
r j = O:n-1 
y(vti*c*n+j+1) = Ftop(s, x(uti*r*n+j+1)); 
end 
end 


0 
0 
1 
O 


f 


function y = Ftop(k,x) 
if k == 1, y =x; 


elseif k == 2, y = F2t(x); 
elseif k == 3, y = F3t(x); 
elseif k == 4, y = F4t(x); 
elseif k == 6, y = F6t(x); 
elseif k == 8, y = F8t(x); 
elseif k == 18, y = F1i8t(x); 
end 


The following programs carry out the operation of Gg ®---®@ Gee were 


G is given by Equation 13 and Equation 14 from Bilinear Forms for 
Circular Convolution. 


function x = Kcrot(p,e,K, x) 
% Kronecker product of Cyclotomic Reduction 


Operations. 
% X = (G(p(1)4e(1)) kron ... kron G(p(K)4(K) ))At*x 
% (transpose) 


%p : p = [p(1),-.+,P(K)]; 
%e : e = [e(1),...,e(K)];, 
a = (p- 1).*((p).*(e-1)); 
r =a; % y(1) = number of rows of G(1) 
Cc = 2*a-1; % C(1) = number of columns of G(1i) 
mM. = a> 
n = prod(r); 
for 1 = 1:K 
n=n/ r(i); 
xX = IcrotI(p(1),e(1),m,n,x); 
m=m * c(1); 
end 


function y = IcrotIi(p,e,m,n,x) 
% Yy = (eye(m) Kron G(p%e)4t kron eye(n))*x 
% (transpose) 
(p-1)*(pr(e-1)); 
a; 
2*a-1; 
zeros(r*m*n,1); 
O:n:(r-1)*n; 
O:n:(c-1)*n; 
or 1 = O:m-1 
for j = O:n-1 
y(vti*r*n+jt+1) = crot(p,e,x(uti*c*n+j+1)); 
end 
end 
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function y = crot(p,e,x) 

% y = crot(p,x) 

% cyclotomic reduction matrix (transpose) 
% length(x) == 2*n-1 

% length(y) == n 

% where n = (p-1)*(p4(e-1)) 


n = (p-1)*(p(e-1)); 
y = zeros(2*n-1,1); 
if p == 
n = pA(e-1); 
y(iin) = x; 
y(nt+1:2*n-1) = -x(1:n-1); 
else 
y(iin) = x; 


L = pA(e-1); 

y(n+1i:n+L) = -x(1:L); 

a scl; 

for k = 2:p-1 
y(nti:n+L) = y(nt+i:in+L) - x(ati:at+L); 
a=at+l; 

end 

b = 2*n-1 - p*(pr(e-1)); 

y(p*L+1:p*Lt+b) = x(1:b); 

end 


The following programs tell the programs for code generation relevant 
information about the bilinear forms for cyclotomic convolution. 
Specifically, they indicates the linear convolution out of which these 
cyclotomic convolution are composed, and the dimensions of the 
corresponding matrices. See the appendix Bilinear Forms for Linear 
Convolution. 


function [d,r,c,Q,Qt] = A_data(n) 

% A : A matrix in bilinear form for cyclotomic 
convolution 

% d : Linear convolution modules used 

% Yr : Yrows 

% C : columns 

%Q : Q(1) = cost associated with D(d(i)) 

% Qt : Qt(1) cost associated with D(d(1i))' 
if n == 2, d [1]; 

elseif n == 4, d [2]; 

elseif n == 8, d [2-2]; 


elseif n == 16, d = [2 2 2]; 
elseif n == 3, d = [2]; 
elseif n == 9, d = [2 3]; 
elseif n == 27, d = [2 3 3]; 
elseif n == 5, d = [2 2]; 
elseif n == 7, d = [2 3]; 
end 


r=[], c= [], Q=[]; Qt = [1], 
for k = 1:length(d) 
[rk, ck, Qk, Qtk] = D_data(d(k)); 
r = [r rk]; c = [ce ck]; Q = [Q Qk]; Qt = 
Qtk]; 
end 


function [r,c,Q,Qt] = D_data(d); 

% D : D matrix in bilinear form for linear 
convolution 

% Y : rows 

% C : columns 

% Q : cost associated with D(d) 

% oF cost associated with D(d)' 


if d ==-1, r=1; c=1; Q=90; Qt = 0; 
elseif d == 2, r= 3; c=2; Q=1; Qt = 2; 
elseif d == 3, r=5; c=3; Q=7; Qt = 9; 


end 


function [f,r,c] C_data(p,e) 
% f : length of linear convolution 
% rows 

: columns 
prod((p-1).*(p.4(e-1))); 
Euler Totient Function) 

2° Fels 

F_data(f); 


a h & 
Woudouwhot. 


function c = F_data(n) 
% c : columns of F matrix 


[Qt 


if n==-1, c=1; 
elseif n == 2, Cc = 3; 
elseif n == 4, c = 9; 
elseif n == 8, C = 27; 
elseif n == 3, c = 5; 
elseif n == 6, c = 15; 
elseif n == 18, c = 75; 
end 


Programs for Inverse Transpose Reduction Operations 


function x = itKRED(P,E,K, x) 
% X = ItKRED(P,E,K,xX); 

% (inverse transpose) 

% P : P = [P(1),...,P(K)]; 

% E : E = [E(K),...,E(K)]; 


for i 1:K 
a = prod(P(1:i-1).%E(1:1i-1)); 
C = prod(P(iti:K).4E(i+1:K)); 
p = P(i); 
e = E(i); 
for j = e-1:-1:0 


X(1:a*c*(p4(j+1))) = LtRED(p,a,c* 
(p%j),x(Lia*c*(pr(j+1)))); 
end 
end 


function y = itRED(p,a,c, x) 
% Y = itRED(p,a,c,x); 
% (inverse transpose) 
y = zeros(a*c*p,1); 
for 1 = O:c:(a-1)*c 
for j = O:c-1 
A = x(i*p+j+1); 
for k = O:c:c*(p-2) 
A =A + x(i*ptj+kt+ct+1); 


end 
y(itj+1) = A; 
for k = O:c:c*(p-2) 
y(i*(p-1)+jtkta*c+1) = p*x(i*ptjt+k+1) 


A; 
end 
end 
end 
y = y/p; 


Programs for Permutations 


The permutation of Equation 18 from Preliminaries is implemented by pfp 
. It calls the function pfp2I . The transpose is implemented by pfpt and 
it calls pfpt21 . 


function x = pfp(n,K, x) 

% X = P(n(1),...,N(K)) * x 

%n = [n(1),...,n(K)]; 

% Length(x) = prod(n(1),...,n(K)) 
a = prod(n); 

s=1; 

for = 
a / n(il); 
pfp2t(a,n(i),s,x); 
Ss. nC)? 


x< 
WoW oth oR 


n 


end 


function y = pfp2I(a,b,s,x) 
% y = kron(P(a,b),I(s)) * x 
% length(x) = a*b*s 


n=a * b; 

y = zer os(n* s,1); 
ki = 0; 

k2 = 0; 

for k = O:n-1 


i1 = s * (k1 + b * k2); 


i2 = s * k; 
for 1 =1:s 
y(i1 + 1) = x(1i2 + 1); 
end 
ki = ki + 1; 
k2 = k2 + 1; 


if ki >= b 
ki = ki - b; 
end 
if k2 >=a 
k2 = k2 - a; 
end 
end 


function x = pfpt(n,K,x) 

% xX = P(n(1),...,N(K))' * x 

% (tanspose) 

%n = [n(1),...,nN(K)]; 

% length(x) = prod(n(1),...,n(K)) 


% a = prod(n); 

a= n(1); 

S = prod(n(2:K)); 

for 1 = 2:K 
s=s /n(i); 
xX = pfpt2iI(a,n(1i),s,x); 
a=a* n(1); 

end 


function y = pfpt2I(a,b,s,x) 
% y = P(a,b)' kron I(s) * x; 
% (transpose) 

% length(x) = a*b*s 


n=a* b; 

y = zeros(n*s,1); 
k1 = 0; 

k2 = 0 


it 
12 
for 1 = 1:s 

y(i2 + 1) = x(ii + 1); 
end 
ki = ki + 41; 
K2 = K2 4 


Wool 
nW 
+ * 


1f Ki -S= bD 
ki = ki - b; 
end 
if k2 >=a 
k2 = k2 - a; 
end 
end 


The following Matlab programs implement Rader's permutation and its 
transpose. They require the primitive root to be passed to them as an 
argument. 


function y = rp(p,r,x) 
% Rader's Permutation 


% p prime 
% a primitive root of p 
% X : Length(x) == p 
a=: ay 
y = zeros(p,1); 
y(1) = x(1); 
for k = 2:p 
y(k) = x(atl); 
a = rem(a*r,p); 
end 


function y = rpt(p,r,x) 

% Rader's Permutation 

% (transpose) 

% p : prime 

% Y : a primitive root of p 


% X : Length(x) == 


ar at As 
y = zeros(p,1); 
y(1) = x(1); 
for k = 2:p 


y(at1) = x(k); 
a = rem(a*r,p); 
end 


function [R, R_inv] = primitive_root(N) 


% function [R, R_inv] = primitive_root(N) 

% Ivan Selesnick 

% N is assumed to be prime. This function 
returns R, 

% the smallest primitive root of N, and R_inv, 
the 

% inverse of R modulo N. 


"Not Found'; 
O:(N-2); 
r xX = 1:(N-1) 
if ( 1:(N-1) == sort(rem2(x,m,N)) ) 
R= xX; 
break 
end 


R 
m 
Fo 


"Not Found'; 

1:N 

if rem(x*R,N) == 1 
R inv = x; 
break 


